Subject: Genomics & Computational Biology Newsletter for Elman
Several recent studies have explored the application of computational methods, including foundation models (FMs) and large language models (LLMs), to complex biological problems, revealing both promising advances and persistent challenges. Xu et al. (2024) (Xu et al., 2024) cast doubt on the universal efficacy of the FM paradigm, demonstrating that in specialized domains like genomics, satellite imaging, and time series analysis, carefully tuned supervised models can outperform state-of-the-art FMs. This underscores the importance of rigorous benchmarking and suggests that the benefits of large-scale pretraining observed in vision and language tasks have yet to fully translate to other domains. Complementing this, Celikkanat et al. (2024) (Celikkanat et al., 2024) revisited k-mer profiles for genome representation learning, proposing a lightweight model for metagenomic binning that rivals the performance of more complex genome foundation models while offering superior scalability. This suggests that simpler, more interpretable methods may still hold significant value in specific genomic tasks.
The application of LLMs to genomic analysis is also gaining traction. Wang et al. (2024) (Wang et al., 2024) explored the potential of LLMs for analyzing the transcriptional regulation of long non-coding RNAs (lncRNAs), finding that fine-tuned genome foundation models show promise in progressively complex tasks. Their work highlights the importance of considering task complexity, model selection, data quality, and biological interpretability when applying LLMs to this challenging area. Simultaneously, Boulaimen et al. (2024) (Boulaimen et al., 2024) investigated the integration of LLMs like GPN-MSA, ESM1b, and AlphaMissense for genetic variant classification, demonstrating improved performance, particularly for Variants of Uncertain Significance (VUS), on benchmark datasets like ProteinGym and ClinVar. This suggests that LLMs can enhance the accuracy and reliability of variant classification, potentially impacting clinical diagnostics.
Beyond LLMs, novel computational approaches are being developed for specific genomic problems. Young and Gilles (2024) (Young & Gilles, 2024) introduced a 3D chaos game representation for quantifying DNA sequence similarity, offering an alignment-free method for phylogenetic analysis. Their approach, based on shape-similarity comparison techniques, performed comparably to alignment-based methods in constructing phylogenetic trees, suggesting its potential for analyzing diverse DNA sequences. Meanwhile, Sena and Tomescu (2024) (Sena & Tomescu, 2024) addressed the computational challenges of RNA transcript assembly by leveraging safe paths and sequences within Integer Linear Programming (ILP) solvers. This optimization significantly improved the scalability of ILP-based methods for RNA transcript assembly, potentially enabling more accurate and efficient analysis of complex transcriptomic data.
Moving beyond purely computational genomics, Meilander et al. (2024) (Meilander et al., 2024) investigated the microbial dynamics of human excrement composting (HEC), revealing a transition from a gut-like microbiome to a soil-like microbiome over time. This work highlights the potential of HEC as a sustainable waste management strategy and provides valuable insights into the complex microbial interactions involved in this process. Finally, Zeiltinger et al. (2024) (Zeiltinger et al., 2024) provided a perspective on recent developments and challenges in regulatory and systems genomics, emphasizing the importance of understanding the cis-regulatory code. They discussed the use of sequence-to-function neural networks, the role of 3D chromatin organization, and the potential of emerging technologies like spatial transcriptomics in deciphering the complex interplay between genetic variation and phenotypic outcomes.
Integrating Large Language Models for Genetic Variant Classification by Youssef Boulaimen, Gabriele Fossi, Leila Outemzabet, Nathalie Jeanray, Oleksandr Levenets, Stephane Gerart, Sebastien Vachenc, Salvatore Raieli, Joanna Giemza https://arxiv.org/abs/2411.05055
Caption: This table presents the performance of different machine learning models trained on various combinations of features derived from Large Language Models (LLMs) for genetic variant classification. The results demonstrate the superior performance of the integrated approach, with the multi-input neural network trained on combined observed and potential scores from GPN, ESM, and AlphaMissense achieving the highest accuracy, F1-score, and ROC-AUC. This highlights the power of integrating diverse LLM-derived features for enhanced variant pathogenicity prediction.
Classifying genetic variants, especially Variants of Uncertain Significance (VUS), is a significant challenge in precision medicine. This research leverages the power of Large Language Models (LLMs) – specifically GPN-MSA, ESM1b, and AlphaMissense – to improve the accuracy of variant pathogenicity prediction. The core hypothesis is that integrating these models, each offering a unique perspective based on DNA sequence, protein sequence, and structural information, respectively, will lead to more robust and accurate classifications compared to using individual models.
The study employed the ProteinGym and ClinVar datasets for model evaluation. Four different feature sets were constructed using potential and observed scores from the LLMs. These feature sets were then used to train various machine learning models, including XGBoost, Random Forest, and single and multi-input neural networks. The DMS_bin_score from ProteinGym served as the target variable. A correlation analysis revealed interesting relationships between the LLM scores: positive correlation between GPN-MSA and ESM1b, and negative correlations between AlphaMissense and the other two, suggesting that each model captures distinct aspects of variant impact.
The results strongly support the integrated approach. The multi-input neural network, trained on the combined observed and potential scores from all three LLMs, achieved the best overall performance. On the ProteinGym dataset, it achieved accuracy and ROC-AUC exceeding 0.89. Importantly, this integrated model outperformed the individual LLMs on the challenging ProteinGym test set, which was specifically designed to include ambiguous variants. It achieved 82.54% accuracy compared to 74.58% for AlphaMissense, 73.84% for ESM1b, and 67.03% for GPN-MSA. Similar improvements were observed on the ClinVar dataset.
Furthermore, the integrated model demonstrated its strength in classifying variants deemed ambiguous by AlphaMissense and/or ClinVar, highlighting its robustness and potential for resolving uncertain classifications in real-world clinical settings. Case studies provided further validation. For instance, the integrated model correctly predicted the pathogenicity of the E563Q mutation in LZTR1, aligning with the experimental DMS_bin_score, while other models misclassified it as benign. Structural analysis using PyMOL corroborated the integrated model's prediction, revealing a potential alteration in protein function due to the mutation.
Revisiting K-mer Profile for Effective and Scalable Genome Representation Learning by Abdulkadir Celikkanat, Andres R. Masegosa, Thomas D. Nielsen https://arxiv.org/abs/2411.02125
Caption: This bar chart compares the number of parameters (on a log10 scale) for various models used in metagenomic binning. It highlights the significantly smaller size of the proposed k-mer based models (OURS (POIS) and OURS (NL)) compared to the larger genome foundation models (HYENADNA, DNABERT-2, and DNABERT-S), emphasizing their lightweight and scalable nature. This reduced complexity makes k-mer models attractive alternatives for large-scale genomic analysis.
While genome foundation models, inspired by LLMs, have shown promise in generating powerful genome representations, their computational demands limit their applicability to large datasets. This study revisits the simpler k-mer-based representations, offering a more scalable and lightweight alternative for tasks like metagenomic binning.
The authors begin with a theoretical analysis of k-mer spaces, demonstrating that under specific conditions, DNA fragments are identifiable based on their k-mer profiles. For fragments that are not perfectly identifiable, they establish bounds on the edit distance using the l₁ distance between their k-mer profiles: $\alpha_l d(r, q) \le ||c_r - c_q||_1 \le \alpha_u d(r, q)$, where $\alpha_l = 1/l$ and $\alpha_u = k|\Sigma|^k$. This theoretical framework provides a solid justification for the use of k-mer-based representations and suggests potential applications beyond metagenomic binning.
Based on this theoretical foundation, the paper proposes three models for learning embeddings of genome fragments using their k-mer representations. Two linear models – a k-mer profile model and a Poisson model that captures dependencies between k-mers using $ \lambda_{x,y} = exp(-||z_x - z_y||^2)$ – are introduced. Additionally, a non-linear model is proposed, utilizing a shallow neural network architecture with self-supervised contrastive learning. This non-linear model learns embeddings by contrasting segments from the same read (positive pairs) with segments from different reads (negative pairs), using a loss function defined as $L_{NL} = -\sum_{(i,j)\in I} y_{ij} log p_{ij} + (1-y_{ij})log(1-p_{ij})$, where $p_{ij} = exp(-||E_{NL}(r_i) - E_{NL}(r_j)||^2)$.
The proposed models were evaluated on metagenomic binning tasks using both CAMI2 challenge datasets and synthetic data. Their performance was compared to state-of-the-art genome foundation models, including HYENADNA, DNABERT-2, and DNABERT-S. Remarkably, the proposed k-mer-based models achieved comparable quality in terms of recovered Metagenome-Assembled Genomes (MAGs) while requiring significantly fewer computational resources. In particular, the non-linear k-mer model demonstrated competitive performance in identifying high-quality bins, matching or exceeding the performance of DNABERT-S on several datasets. An ablation study further confirmed the importance of the k-mer length (k) and embedding dimension in optimizing model performance.
Exploring the Potentials and Challenges of Using Large Language Models for the Analysis of Transcriptional Regulation of Long Non-coding RNAs by Wei Wang, Zhichao Hou, Xiaorui Liu, Xinxia Peng https://arxiv.org/abs/2411.03522
Caption: This figure illustrates the four progressively complex tasks used to evaluate the performance of large language models (LLMs) in understanding lncRNA regulation. The tasks range from easy (biological vs. artificial sequence classification) to hard (coding vs. non-coding promoter classification), reflecting the increasing difficulty in predicting lncRNA gene expression based on sequence information.
Long non-coding RNAs (lncRNAs) play critical roles in various biological processes, but their functional mechanisms and transcriptional regulation remain largely elusive. This study investigates the potential of Large Language Models (LLMs) to shed light on these complex regulatory processes through sequence analysis. The researchers fine-tuned three genome foundation models – DNABERT, DNABERT-2, and Nucleotide Transformer – on four progressively complex tasks related to lncRNA gene expression: distinguishing biological from artificial sequences, identifying promoter sequences, classifying promoters of highly vs. lowly expressed genes, and classifying promoters of protein-coding vs. lncRNA genes.
The methodology involved fine-tuning the pre-trained LLMs on datasets curated for each task. Standard fine-tuning was used for DNABERT and DNABERT-2, while parameter-efficient fine-tuning with Low-Rank Adaptation (LoRA) was employed for the larger Nucleotide Transformer. Model performance was evaluated using accuracy, F1-score, and Matthews Correlation Coefficient (MCC). A baseline model based on Logistic Regression with n-gram TF-IDF features was also implemented for comparison. Feature importance analysis, leveraging attention scores, was conducted to pinpoint the regions within promoter sequences that most influence predictions.
The results demonstrate that fine-tuned LLMs outperformed the baseline model, particularly on more complex tasks that required contextual understanding, such as distinguishing promoters of highly vs. lowly expressed genes. For instance, DNABERT (3-mer) achieved the highest MCC of 73.48% on this task. However, for simpler tasks like distinguishing biological from artificial sequences, traditional machine learning methods were almost as effective. Interestingly, the most challenging task – directly classifying protein-coding vs. lncRNA gene promoters – saw a significant drop in LLM performance, suggesting the involvement of factors beyond promoter sequences.
The study highlights the importance of considering task complexity and model selection when applying LLMs to biological sequence analysis. Data quality also emerged as a crucial factor, with artificially generated data potentially leading to inflated performance estimates. Promoter sequence length also influenced accuracy, with shorter sequences generally yielding better results. Feature importance analysis revealed that the initial 80bp upstream of transcription start sites (TSSs) contributed most significantly to gene expression prediction, pointing to the presence of key regulatory elements in this region.
This newsletter showcases the evolving landscape of computational genomics. While the dominance of foundation models is being challenged in specialized areas like genomics, alternative approaches are emerging. The work by Celikkanat et al. demonstrates that simpler, more interpretable k-mer based models can achieve comparable performance to complex foundation models in metagenomic binning, with the added benefit of improved scalability. Simultaneously, the potential of LLMs is becoming increasingly apparent. Boulaimen et al.'s research on integrating LLMs for genetic variant classification highlights the power of combining diverse data sources and modeling approaches to improve prediction accuracy, particularly for challenging VUS classifications. Wang et al.'s exploration of LLMs for lncRNA regulation analysis further underscores the promise of these models in deciphering complex biological processes, although careful consideration of task complexity, data quality, and interpretability is crucial. Collectively, these studies highlight the ongoing search for the right balance between model complexity and performance in tackling specific genomic challenges, paving the way for more accurate, efficient, and insightful biological data analysis.