Taking newly generated datasets, published temporal datasets and Arabidopsis genomes, we trained ML models to make predictions about circadian gene regulation and expression patterns. Our ML models classify circadian expression patterns using iteratively lower numbers of transcriptomic timepoints, which is an improvement in accuracy compared to the existing state-of-the-art models.
We also introduced model interpretation to quantify the best transcriptomic timepoints for sampling. We believe that predictive insight on when to sample will be a valuable reference for experimental biologists when planning experiments. Next, we redefined the field by developing ML models to distinguish circadian transcripts that don’t use transcriptomic timepoint information, but instead use DNA sequence features generated from public genomic resources. This allows us to predict the circadian regulation of genes simply by analyzing the genome sequence.
Our decision to do this was based on the theory that a major mechanism of gene expression control, be it circadian or other mechanisms, is through transcription factors (and other factors) that bind to regulatory DNA sequences. Transcription factors are vital molecules that can control gene expression, directing when, where and to what degree genes are expressed. They bind to specific sequences of DNA and control the transcription of DNA into mRNA.
What makes our models more informative is our usage of explainable AI algorithms. We used the interpretation of our ML models to illuminate what’s inside the “black box,” so that we can better understand the predictions they make. We used a local model explanation that is transcript-specific to rank DNA sequence features, which provide a detailed profile of the potential circadian regulatory mechanisms for each transcript.
Using the local explanation derived from ranked DNA sequence features allows us to distinguish the temporal phase of transcript expression and in doing so, reveal hidden sub-classes within the circadian class, e.g., whether a transcript is likely to show its peak expression in the day, or night.
Model interpretation and explanation provides the backbone of our methodological advances because it gives insight into biological processes and experimental design. This approach can optimize sampling strategies when we predict circadian transcripts using reduced numbers of transcriptomic timepoints.
Finally, our models predict the circadian time from a single transcriptomic timepoint. This identifies novel marker transcripts that are most impactful for accurate predictions, which could facilitate the identification of altered circadian clock function from existing datasets. These applications of explainable AI could redefine how we reuse public data and how we generate testable hypotheses to understand gene expression control.
Our research describes a series of AI- and ML-based approaches that have the potential to enable more cost-effective analysis and insight into circadian regulation and function. While we initially worked with the model plant Arabidopsis, where extensive genome resources allow experimental validation of findings, this approach has widespread implications for other complex or temporal gene expression patterns, as well as other Arabidopsis ecotypes—some of which we have tested already. Furthermore, in our published work we adapted our ML approach for wheat to show that our methods allow accurate analysis of key food crops.
Our ML models and their application in crops, where circadian rhythms are critical to maintaining healthy growth and development, could potentially lead to increased yields as agricultural scientists and farmers begin to use the model to understand the inner rhythms of the plants they grow and harvest.
But the technology we developed goes beyond the scope of plants. We are now looking at different species to investigate the circadian clock and its link to disease in humans; for example, where the dysregulation of the circadian clock has been associated with a range of diseases from depression to cancer.
Gardiner, L., Rusholme-Pilcher, R., Colmer, J., et al. Interpreting machine learning models to investigate circadian regulation and facilitate exploration of clock function. PNAS. August 10, 2021 118 (32) e2103070118 ↩