This collection of preprints explores the diverse applications of statistical and machine learning methodologies across various domains. A key focus is enhancing predictive models through innovative feature engineering and model selection strategies. Buhamra & Groll (2025) Buhamra & Groll (2025) investigate statistically enhanced covariates, derived from separate statistical models, to improve tennis match prediction, mirroring successful applications in football. Similarly, Royal et al. (2025) Royal et al. (2025) propose a hybrid statistical model, combining Generalized Additive Models (GAMs) with Seasonal Autoregressive Integrated Moving Average (SARIMA) models, for long-term electric load forecasting in district energy systems. For analyzing mutational signatures, Hansen et al. (2025) Hansen et al. (2025) introduce a Bayesian Multi-Study Non-negative Matrix Factorization (NMF) approach, incorporating sparsity in exposure weights and enabling covariate-dependent signature identification. These studies highlight the importance of tailoring statistical and machine learning approaches to specific data characteristics and domain knowledge for improved predictive performance.
Another prominent theme is the development of methods for quantifying and interpreting complex relationships within data. Faes et al. (2025) Faes et al. (2025) utilize predictive information decomposition and vector autoregressive (VAR) models to quantify emergent dynamical behaviors in physiological networks. Sechidis et al. (2025) Sechidis et al. (2025) leverage individual treatment effects, estimated using a Double Robust (DR) learner, to assess treatment effect heterogeneity in clinical trials. Chen et al. (2025) Chen et al. (2025) develop a two-stage model for inferring trip purposes and socio-economic attributes of public transit users, combining rule-based and XGBoost models. These contributions highlight the growing interest in moving beyond simple predictive modeling towards a deeper understanding of underlying processes and mechanisms.
Several papers address methodological challenges in specific statistical frameworks. Zhu & Li (2025) Zhu & Li (2025) propose a flexible Bayesian tensor decomposition for verbal autopsy data, balancing predictive accuracy and interpretability. Song et al. (2025) Song et al. (2025) introduce the Poisson Process AutoDecoder (PPAD), a neural field decoder for analyzing X-ray source data, directly capturing the Poisson nature of photon arrival times. Rigon et al. (2025) Rigon et al. (2025) develop Bayesian nonparametric methods for biodiversity estimation, using Gibbs-type priors and accommodating Linnean taxonomy. Odiathevar & Yup (2025) Odiathevar & Yup (2025) present a statistical approach for simulating realistic network traffic, enabling more robust testing of network monitoring tools. Vicentini & Jermyn (2025) Vicentini & Jermyn (2025) investigate prior selection for Dirichlet Process Mixtures, proposing a sample-size-independent methodology.
Further contributions include Valachovic's (2025) Valachovic (2025) Extended Kolmogorov-Zurbenko (EKZ) filter for time series analysis. Azze et al. (2024) Azze et al. (2024) model wind farm storage using a semi-Markov modulated Brownian bridge approach. Dong et al. (2024) Dong et al. (2024) propose a method for constructing simultaneous confidence bands for errors-in-variables curves.
Finally, several papers explore applying statistical methods to social phenomena. Qiu et al. (2025) Qiu et al. (2025) propose an estimator-robust design for augmenting RCTs with external real-world data using adaptive targeted maximum likelihood estimation (A-TMLE). Neher et al. (2025) Neher et al. (2025) present a Bayesian integrative mixed modeling framework for analyzing the ABCD Study. Jørgensena et al. (2025) Jørgensena et al. (2025) introduce NABQR, a Python package for improving probabilistic forecasts. Ye & Wedel (2025) Ye & Wedel (2025) develop SIGN, a gaze network for gaze time prediction. Ackerman et al. (2025) Ackerman et al. (2025) focus on multi-metric evaluation of LLMs.
Improving LLM Leaderboards with Psychometrical Methodology by Denis Federiakin https://arxiv.org/abs/2501.17200
The rapid advancement of Large Language Models (LLMs) necessitates robust evaluation methods. Current LLM leaderboards often rely on simplistic aggregation methods, like averaging benchmark scores, which may not accurately capture the complex interplay of LLM abilities. This paper advocates for using established psychometric methodologies to refine LLM evaluation and ranking. It highlights the crucial distinction between test development (focused on validity) and benchmark development (focused on representativism), which has significant implications for measurement design and modeling.
The paper explores applying Confirmatory Factor Analysis (CFA) to LLM performance data from the Hugging Face Leaderboard. Due to the large number of benchmark questions compared to the number of LLMs, the authors employed parceling, aggregating homogeneous items into single variables for CFA analysis. This method addresses the limitations of simple averaging by accounting for varying item difficulty and sensitivity to the latent variable being measured. The core formula, U<sub>mp</sub> = logit<sup>-1</sup>(μ<sub>p</sub> + λ<sub>p</sub>θ<sub>m</sub> + ω<sub>mp</sub>), where U<sub>mp</sub> is the performance of model m on parcel p, allows for estimating factor scores (θ<sub>m</sub>) that represent a refined measure of LLM ability.
The analysis revealed a single general factor underlying performance across benchmarks, suggesting a general intelligence factor (g-factor) for LLMs. However, residual dependencies related to task content suggest specialized abilities also contribute to performance. Comparing average scores with estimated factor scores revealed an inverted U-shaped relationship, indicating that raw averages may overestimate weaker models and underestimate stronger ones. The normalized data provided more precise ability estimates. The analysis also highlighted the diminishing returns of LLM scaling, a trend better captured by factor scores.
The paper suggests investigating benchmarks for items that disproportionately penalize higher-ability models and proposes using longitudinal psychometric models with anchor items to compare across leaderboard versions. Analyzing residual correlations could also reveal the structure of LLMs' world models. Limitations include vague construct definitions in benchmarks and the challenge of obtaining a representative LLM sample.
An Estimator-Robust Design for Augmenting Randomized Controlled Trial with External Real-World Data by Sky Qiu, Jens Tarp, Andrew Mertens, Mark van der Laan https://arxiv.org/abs/2501.17835
Caption: Impact of RWD quality and sample size on the number of selected external patients using the proposed matching strategy versus random sampling.
Augmenting randomized controlled trials (RCTs) with real-world data (RWD) can enhance statistical power, but combining these data sources presents challenges due to confounding and inconsistencies. This paper introduces a novel, estimator-robust design strategy for integrating RWD with RCTs using Adaptive Targeted Maximum Likelihood Estimation (A-TMLE).
The strategy involves a two-step matching process. First, RCT participants are matched with external patients based on the trial enrollment score, P(S=1|W). This balances covariate distributions between the RCT and RWD. Second, within the matched external cohort, treated patients are matched with control patients based on the propensity score in the external data, P(A=1|S=0, W). This addresses confounding within the RWD. This matching aims to make both the trial enrollment score and the external propensity score approximately constant with respect to W, improving the robustness of A-TMLE. The pooled-ATE estimand, Ψ(P₀) = E₀[E₀(Y|W, A=1) - E₀(Y|W, A=0)], and the bias estimand, Ψ#(P₀) = E₀[Π₀(0|W,0)τs,₀(W,0) - Π₀(0|W,1)τs,₀(W,1)], are central to this approach.
Simulations comparing the matching strategy to random sampling demonstrated narrower confidence intervals, nominal coverage, and improved power with the matching strategy, especially under misspecified bias models. A case study augmenting the DEVOTE trial with claims data showed that the matching strategy reduced discrepancies between RWD and RCT estimates. The A-TMLE estimator on the matched and pooled data produced statistically significant results with narrower confidence intervals compared to the RCT-only analysis and another estimator (ES-CVTMLE). This highlights the practical value of combining a robust matching design with a robust estimator like A-TMLE.
Detecting clinician implicit biases in diagnoses using proximal causal inference by Kara Liu, Russ Altman, Vasilis Syrgkanis https://arxiv.org/abs/2501.16399
Caption: This causal diagram illustrates the relationships between patient attributes (D), health state mediator (M), health proxies (Z, X), medical outcome (Y), and sociodemographic confounders (W) to analyze implicit bias (θ) in medical diagnoses. The method assesses the causal effect of patient attributes on medical outcomes, decomposing it into biological effects (mediated through M) and implicit bias effects (independent of M). Unobserved health states are addressed using observed medical data (Z, X) as proxies.
Implicit bias in healthcare can lead to unequal outcomes. Existing methods for measuring implicit bias often rely on subjective assessments and may not reflect real-world behavior. This research proposes a causal inference method to detect implicit bias effects on patient outcomes using large-scale observational data.
The method analyzes the causal effect of a patient's sociodemographic attributes on medical diagnoses, decomposing this effect into two pathways: the biological effect (mediated through valid biological traits) and the implicit bias effect (independent of the patient's true health state). Since the true health state is usually unobserved, researchers use observed medical data as proxies and employ proximal causal inference. A new proximal mediation method estimates the implicit bias effect, defined as the controlled direct effect:
θ = ∫(m,w) E[Y(1,m) – Y(0,m) | W=w] p(m,w) dm dw_
where Y(d,m) is the potential outcome with interventions on the attribute (D) and mediator (M), and W represents confounders. A non-zero θ suggests implicit bias. The method assumes a partially linear relationship between variables and uses linear instrumental variable regression for estimation. Tests are introduced to assess the validity of assumptions.
Validation using semi-synthetic and UK Biobank data showed promising results. In semi-synthetic experiments, the method accurately retrieved known implicit bias effects. In the UK Biobank data, initial results using all proxies were invalid due to assumption violations. However, after applying a proxy selection algorithm, significant implicit bias effects were identified in several attribute-diagnosis pairs. For example, a negative bias was detected for female patients diagnosed with heart disease, while a positive bias was found for Black patients diagnosed with chronic kidney disease.
Limitations include the static nature of the UK Biobank data and the assumption of partially linear relationships. Future work will focus on validating the method with time-series EHR data and exploring non-linear relationships. The method aims to provide insights into systemic biases for informing interventions and training programs, rather than targeting individual clinicians.
Rethinking the Win Ratio: A Causal Framework for Hierarchical Outcome Analysis by Mathieu Even, Julie Josse https://arxiv.org/abs/2501.16933
Caption: Comparison of Average Estimated Treatment Effect in a Randomized Controlled Trial with Heterogeneous Outcomes
Evaluating treatments with complex, multivariate outcomes is challenging. Existing methods like the Win Ratio, while used in practice, lack a strong causal foundation. This paper establishes causal foundations for hierarchical comparison methods, revealing potential issues and proposing a more robust approach.
The problem lies in how patient pairs are formed for comparison. Existing methods often use complete pairings, leading to a population-level causal estimand, which can lead to misleading recommendations in heterogeneous populations. The authors introduce an individual-level, identifiable causal estimand, τ = E[w(Y(Xᵢ)(1)|Yⱼ(0))], comparing a treated patient to a hypothetical control patient with identical features. Using Nearest Neighbor pairings provides a consistent estimator for τ in RCTs, approximating the ideal individual-level estimand.
The framework extends to observational studies using propensity weighting and distributional regression to address dimensionality and missing covariates. A doubly robust estimator is also introduced. Simulations demonstrated superior performance of the proposed methods, especially in higher dimensions and under model misspecification, highlighting the robustness of the distributional regression approach. This work provides a significant advance in the causal analysis of hierarchical outcomes.
A Bayesian Integrative Mixed Modeling Framework for Analysis of the Adolescent Brain and Cognitive Development Study by Aidan Neher, Apostolos Stamenos, Mark Fiecas, Sandra Safo, Thierry Chekouo https://arxiv.org/abs/2501.17705
Caption: Comparison of BIPmixed with other methods across three simulated scenarios, demonstrating superior predictive performance through closer alignment of predicted and true outcome values.
Integrating high-dimensional, heterogeneous data from multi-site cohort studies with nested hierarchical structures presents significant challenges. This paper introduces the Bayesian Integrative Mixed Modeling (BIPmixed) framework, extending the existing BIP framework by incorporating nested random effects for simultaneous feature selection and outcome modeling in hierarchical data.
BIPmixed utilizes a multi-view learning approach, decomposing each data view X<sup>(m)</sup> into shared and view-specific components: X<sup>(m)</sup> = UA<sup>(m)</sup> + E<sup>(m)</sup>. The outcome is treated as an additional view, adjusted for covariates and nested random effects θ. Binary indicators facilitate feature and component selection.
Simulations and application to the ABCD Study demonstrated BIPmixed's superior predictive accuracy compared to other methods, particularly in scenarios with nested random effects. In the ABCD Study, BIPmixed identified relevant imaging and early life adversity features associated with externalizing problems. Incorporating covariates directly into the outcome model further improved prediction performance.
Limitations include treating the outcome as just another view and the reliance on exchangeable correlation structures. Further research is needed to address these limitations and expand BIPmixed's applicability.
This newsletter highlights advancements in statistical and machine learning methodologies across diverse domains. From improving LLM leaderboard rankings using psychometrics to detecting implicit bias in medical diagnoses through causal inference, the papers showcase the increasing sophistication of analytical tools. The development of robust methods for integrating real-world data with RCTs, as well as novel approaches for analyzing hierarchical outcomes and complex multi-site studies like the ABCD Study, further underscores the ongoing efforts to refine statistical techniques for addressing real-world problems. The common thread across these studies is the emphasis on moving beyond simple predictive modeling towards a deeper understanding of underlying processes and the development of more robust and interpretable methodologies.