Recent research has explored novel computational approaches to analyze genomic and transcriptomic data, revealing intricate relationships between structure and function in biological systems. For instance, Shi et al. (2025) Shi et al. (2025) developed a theoretical framework to predict chromatin dynamics from static Hi-C or imaging data. Their model, based on calculated 3D chromatin structures, accurately forecasts two-point dynamics, uncovering a subdiffusive scaling relationship between relaxation time and genomic separation. This finding challenges traditional polymer models and fractal globule descriptions of chromosome organization, suggesting that static 3D structure dictates dynamic interactions, particularly between enhancers and promoters. Furthermore, their model predicts faster dynamics upon cohesin depletion, offering testable predictions for future experiments.
Moving from structural dynamics to gene expression analysis, Villatoro-García et al. (2025) Villatoro-García et al. (2025) introduce GSEMA, a novel gene set enrichment meta-analysis method. Addressing the challenges of missing genes and cross-platform discrepancies in transcriptomic data, GSEMA aggregates gene expression into pathway-level matrices using single-sample enrichment scoring. This approach preserves effect size and directionality, enabling the definition of pathway activity across datasets. Their validation on simulated and real datasets (SLE and PD) demonstrates improved control of false positives and enhanced biological interpretability compared to traditional gene-level meta-analysis.
Focusing on specific phenotypic traits, Younuskunju et al. (2025) Younuskunju et al. (2025) investigated the genetic basis of date palm fruit size. Integrating genomic and phenotypic data, they identified significant loci associated with fruit length, width, area, and weight. Candidate genes near these loci, implicated in cell differentiation, proliferation, and growth regulation pathways (e.g., auxin and abscisic acid signaling), showed high expression during early fruit development, coinciding with maximal size and weight attainment. This work provides valuable insights into the genetic determinants of fruit size, crucial for crop improvement.
Several studies also presented new bioinformatics tools for analyzing metagenomic data. Aroney et al. (2025) Aroney et al. (2025) introduced CoverM, a software package for calculating read coverage statistics. Leveraging 'Mosdepth arrays' for efficiency, CoverM streams read alignment results to calculate various coverage metrics for contigs and genomes, offering a unified and flexible approach for metagenomic analysis. Meanwhile, Qu et al. (2025) Qu et al. (2025) developed GiantHunter, a reinforcement learning tool for detecting giant viruses in metagenomic data. Utilizing a Monte Carlo tree search strategy, GiantHunter dynamically selects negative training data, improving the precision and efficiency of giant virus identification. Application of GiantHunter to metagenomic datasets from the Yangtze River revealed dam-associated shifts in viral diversity, demonstrating the tool's potential for ecological studies.
Finally, He et al. (2025) He et al. (2025) proposed a hypergraph representation of scRNA-seq data to overcome limitations of traditional network-based analyses. Their approach captures higher-order relationships between cells and genes, mitigating information loss and zero-inflation issues. They introduce two novel clustering methods, DIPHW and CoMem-DIPHW, which leverage hypergraph walks and integrate coexpression information. These methods demonstrate superior performance, particularly in datasets with weak modularity, offering a promising new avenue for scRNA-seq analysis. Collectively, these studies showcase the power of innovative computational approaches to unravel complex biological phenomena, from chromatin dynamics to metagenomic diversity.
Static Three-Dimensional Structures Determine Fast Dynamics Between Distal Loci Pairs in Interphase Chromosomes by Guang Shi, Sucheol Shin, D. Thirumalai https://arxiv.org/abs/2501.10004
*Caption: The figure shows the application of a new theoretical framework to predict chromatin dynamics from static 3D chromosome structures derived from Hi-C data. Panel (a) displays a Hi-C contact map, while (b) shows distance distributions. Panels (c) and (e) compare predicted and experimental scaling of mean distance and relaxation time with genomic separation, respectively. Panel (d) depicts the mean square displacement over time, and (f) shows the scaling of relaxation time with genomic separation from simulations. These results demonstrate the theory's ability to accurately predict experimentally observed fast transcriptional dynamics in Drosophila. *
A recent study revealed surprisingly fast dynamics between distal enhancer and promoter regions in Drosophila chromosomes, despite the compact overall structure. This observation challenges the traditional understanding of the structure-function relationship and existing polymer models like the fractal globule and Rouse models, which predict significantly slower dynamics. These models typically relate the relaxation time, τ, between two loci separated by genomic distance s as τ ~ s<sup>γ</sup>. The fractal globule model predicts γ = 5/3, and the Rouse model predicts γ = 2, but experiments have found γ ≈ 0.8. This discrepancy, termed a "conundrum," prompted the development of a new theoretical framework to reconcile these findings.
The new theory proposes that the static 3D chromosome structure holds sufficient information to predict its dynamic behavior. This theory involves a two-step process. First, the 3D structure is calculated from experimental Hi-C contact maps using the HIPPS/DIMES method, which employs a maximum entropy principle to generate an ensemble of 3D structures consistent with the contact map. The mean distances, <r<sub>ij</sub>>, between loci i and j are related to contact probabilities, p<sub>ij</sub>, by <r<sub>ij</sub>> = (p<sub>ij</sub>)<sup>-1/α</sup>, with α ≈ 4. Second, the dynamics are calculated by interpreting the Lagrange multipliers, k<sub>ij</sub> (obtained during the 3D structure calculation), as spring constants in a harmonic potential within a chromatin network. This allows the calculation of dynamical correlation functions using standard polymer dynamics theory.
The theory was validated by accurately reproducing the established scaling relations for the Rouse chain, self-avoiding walk, and fractal globule models. When applied to experimental Micro-C data from Drosophila embryos, the theory successfully predicted the observed fast transcriptional dynamics, with a scaling exponent γ ≈ 0.7, remarkably close to the experimental value of 0.8. The theory also accurately predicted the two-point mean square displacement, M<sub>2</sub>(t), and the mean first passage time for contact, τ<sub>c</sub>, which scales with the mean spatial distance, <r>, as τ<sub>c</sub> ~ <r><sup>3.4</sup>. This scaling is consistent with calculations based on experimental trajectory data and qualitatively agrees with predictions from the Szabo, Schulten, and Schulten (SSS) theory for contact formation.
Moreover, the theory revealed that single-locus diffusion is heterogeneous and anti-correlated with local chromatin density, as quantified by a centrality measure. The theory also predicts that cohesin depletion increases diffusivity and alters two-point relaxation times in a locus-dependent manner. These predictions can be tested experimentally. This study demonstrates that static 3D chromosome structure, when analyzed with the appropriate theoretical framework, can accurately predict the dynamic behavior of chromatin, resolving the apparent disconnect between structure and function observed in recent experiments. This generalizable framework opens new possibilities for investigating chromatin dynamics across various species and contributes to a deeper understanding of genome organization and gene regulation.
Hypergraph Representations of scRNA-seq Data for Improved Clustering with Random Walks by Wan He, Daniel I. Bolnick, Samuel V. Scarpino, Tina Eliassi-Rad https://arxiv.org/abs/2501.11760
Caption: This figure visually summarizes the CoMem-DIPHW pipeline for single-cell RNA sequencing analysis. It depicts the conversion of scRNA-seq data into hypergraph and network representations, followed by the dual-importance preference hypergraph walk process and embedding generation using a neural network. The illustration highlights the steps involved in generating cell embeddings from the hypergraph structure, including the random walk process across different hyperedges (genes) and vertices (cells).
Single-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, but traditional analysis methods often rely on network projections, such as coexpression networks, which can lead to information loss and spurious correlations due to data sparsity. This paper introduces two novel hypergraph-based clustering methods, Dual-Importance Preference Hypergraph Walk (DIPHW) and Coexpression and Memory-Integrated DIPHW (CoMem-DIPHW), to address these limitations and enhance cell clustering accuracy.
The central concept is representing scRNA-seq data as a hypergraph, where cells are nodes and genes are hyperedges. Each hyperedge connects to the cells expressing that gene, weighted by the expression level. DIPHW performs a random walk on this hypergraph, considering both the relative importance of genes to cells (P(E|V)) and cells to genes (P(V|E)), along with a preference exponent (α) to accelerate convergence. CoMem-DIPHW expands upon this by incorporating a memory mechanism that leverages cell and gene coexpression networks (G<sub>V</sub> and G<sub>E</sub> respectively) to influence transition probabilities based on previously visited nodes and edges, capturing both local and global information. The node-to-node transition probability in DIPHW is given by:
P(u → v) = Σ<sub>e∈E(u)</sub> [w(e)γ<sub>e</sub>(u) / Σ<sub>e'∈E</sub> w(e')γ<sub>e'</sub>(u)] * [γ<sub>e</sub>(v) / Σ<sub>u'∈e</sub> γ<sub>e</sub>(u')]
where w(e) is the weight of hyperedge e, and γ<sub>e</sub>(u) is the weight of node u in hyperedge e.
These methods were evaluated on simulated and real scRNA-seq datasets, comparing them to established methods like PCA, Node2Vec, and Louvain. On simulated data, DIPHW and CoMem-DIPHW consistently outperformed other methods, particularly in scenarios with weak modularity. The improvement was especially pronounced in cases with low numbers of co-expressed genes per module or high numbers of embedded modules. In empirical analyses using human pancreas data, CoMem-DIPHW effectively identified distinct cell clusters, as validated by the expression of known cell-type markers and differential expression analysis. Furthermore, the authors demonstrated that ignoring shared zeros in expression profiles, as done by cosine similarity, significantly impairs clustering performance, highlighting the importance of appropriately handling sparsity.
This work demonstrates the power of hypergraph representations for capturing the complex relationships in scRNA-seq data. By avoiding information loss from projections and addressing sparsity-induced correlation inflation, DIPHW and CoMem-DIPHW offer improved accuracy and robustness in cell clustering. The flexible simulation framework introduced also provides a valuable tool for benchmarking and exploring the impact of different data characteristics on clustering performance. Future work will focus on optimizing the memory mechanism and improving scalability for larger datasets.
Effect Size-Driven Pathway Meta-Analysis for Gene Expression Data by Juan Antonio Villatoro-García, Pablo Pedro Jurado-Bascón, Pedro Carmona-Sáez https://arxiv.org/abs/2501.13583
Traditional gene expression meta-analysis, while powerful, often encounters challenges such as missing genes and platform-specific discrepancies. These issues can lead to significant data loss and limit biological insights, especially when integrating data from different sources like RNA-seq and microarrays. Current methods often rely on imputing missing genes or combining p-values from pathway enrichment analyses, which can introduce bias or discard valuable information about effect directionality and magnitude.
To address these limitations, researchers have developed GSEMA (Gene Set Enrichment Meta-Analysis), a novel methodology that utilizes single-sample enrichment scoring. GSEMA transforms gene expression data into pathway-level matrices, applying meta-analysis techniques to enrichment scores rather than individual genes. This preserves the magnitude and directionality of effects, allowing for a deeper understanding of pathway activity across datasets. The method starts by calculating pathway activity scores for each sample using single-sample enrichment methods like ssGSEA, GSVA, Zscore, or singscore. These scores are then normalized and filtered to remove lowly expressed pathways. A moderated t-test is then employed to calculate effect sizes (Hedges' g) for each pathway in each study, with a bias correction applied to the variance of Hedges' g (Vgij(gi) = 1/nje + 1/njc + gi^2/(2*(nje*njc))). Finally, a random-effects meta-analysis model integrates effect sizes across studies, and the resulting p-values are adjusted for multiple testing.
GSEMA's performance was evaluated using both simulated and real-world datasets. In simulations, GSEMA demonstrated superior control of false positive rates compared to existing methods like MAPE and traditional meta-analysis followed by gene set enrichment analysis (MA_GSA). While other methods generated over 50 false positives in most simulated scenarios, GSEMA methods produced significantly fewer. Specifically, the "Simulated_Pathway" (a positive control) was identified as significant in only 0-5% of GSEMA simulations, compared to over 45% for MAPE and MA_GSA. In real-world analyses of Systemic Lupus Erythematosus (SLE) and Parkinson's Disease (PD) datasets, GSEMA successfully identified biologically relevant pathways, including those related to immune response and interferon signaling in SLE. Importantly, GSEMA effectively handled datasets with missing genes and those generated from different platforms, demonstrating its ability to preserve information and improve data harmonization. In the SLE analysis with missing genes, GSEMA identified pathways related to immune and interferon responses, while traditional methods using only common genes failed to find these important associations. In the PD analysis, GSEMA identified novel up-regulated pathways, such as Neuroactive ligand-receptor interaction, which were missed by previous analyses.
GSEMA offers a robust and biologically informative approach to pathway meta-analysis of gene expression data. By focusing on pathway-level information and leveraging effect size-based meta-analysis, GSEMA overcomes limitations of traditional methods and provides a more comprehensive understanding of pathway activity across diverse datasets. While computationally intensive for large datasets, GSEMA's ability to integrate data from different platforms and handle missing genes makes it a valuable tool for transcriptomics research. The method is available as an R package on CRAN.
Normalization and selecting non-differentially expressed genes improve machine learning modelling of cross-platform transcriptomic data by Fei Deng, Catherine H Feng, Nan Gao, Lanjing Zhang https://arxiv.org/abs/2501.14248
This study investigates how to improve machine learning (ML) model performance when analyzing transcriptomic data from different platforms, specifically microarray and RNA-seq. Cross-platform analysis is challenging due to inherent differences in data generation, which hinders direct comparisons. This research explores the hypothesis that incorporating non-differentially expressed genes (NDEGs) can enhance data normalization and, consequently, cross-platform ML modeling. The study utilizes TCGA breast cancer datasets, with microarray data serving as the training set and RNA-seq data as the independent test set (Model-S), and vice-versa (Model-A), to classify molecular subtypes.
The methodology involves several key steps. Initially, data cleaning was performed to retain matched genes and samples with subtype labels. ANOVA was then used for gene selection, identifying NDEGs (p>0.85) for normalization and differentially expressed genes (DEGs, p<0.05) for classification. Several normalization methods were tested, including log-transformation (LOG), Z-score transformation (Z), non-parametric normalization (NPN), quantile normalization (QN), normal score transformation (NST), and combinations of these with NDEG-based normalization. Five common classifiers (SVM, RF, LASSO, MLP, XGB) were trained and evaluated using balanced accuracy, Kappa statistic, F1 score, AUC, sensitivity, and specificity. An E-value, combining the mean and variance of Kappa and balanced accuracy (Evalue = -100(Kappa * Balanced Accuracy) * log (σκappa * σBalanced Accuracy)), was used for model selection to account for robustness.
The results show that NDEG-based normalization significantly improves classification performance, particularly when combined with LOG-QN or LOG-QN-Z normalization and MLP, LR, or SVM classifiers in Model-S. For example, MLP achieved a Kappa of over 0.83 and balanced accuracy of 0.700 with LOG-QN or LOG-QN-Z. Similar trends were observed in Model-A, with MLP exhibiting the best and most stable performance (Kappa of 0.734 and balanced accuracy of 0.718). Interestingly, RF performed poorly in Model-A when using NDEGs. Generally, LOG and Z-score normalization performed poorly, while QN and NPN showed better results without NDEGs. The study underscores the importance of considering the interplay between gene selection, normalization, and classifier choice for optimal performance.
The discussion emphasizes the importance of a robust evaluation metric like the E-value, as relying solely on balanced accuracy or Kappa can be misleading with imbalanced datasets. The authors acknowledge that the high dimensionality of the data and potential technical noise may have inflated the number of selected DEGs, suggesting that future research should explore more sophisticated gene selection strategies and non-parametric statistical methods. The study also notes the limitations imposed by the small sample size and computational constraints, which prevented a more exhaustive exploration of parameter space.
This newsletter highlights a convergence of innovative computational approaches transforming biological data analysis. From predicting chromatin dynamics based on static 3D structures to leveraging hypergraphs for enhanced scRNA-seq clustering, these studies demonstrate the power of advanced computational tools to unravel complex biological phenomena. The development of GSEMA addresses critical challenges in gene expression meta-analysis, while the exploration of NDEG-based normalization offers promising improvements for cross-platform machine learning models. These advancements collectively contribute to a deeper understanding of genome organization, gene regulation, and cellular heterogeneity, paving the way for more accurate and insightful biological discoveries.