ArXiv Pulse - Stay updated with the latest research papers

General Overview

This collection of preprints explores diverse methodological advancements across various domains, including ophthalmology, manufacturing, causal inference, and time series analysis. Sielinski (2025) (Sielinski, 2025) critically assesses the Belin/Ambrósio Deviation (BAD) model for keratoconus detection, revealing inconsistencies in the Total Deviation Value due to systematic bias, multicollinearity, and normative dataset inconsistencies. In manufacturing, Megahed et al. (2025) (Megahed et al., 2025) investigate the adaptability of OpenAI's CLIP model for few-shot image inspection, demonstrating its efficacy in single-component and texture-based applications, but noting limitations in complex multi-component scenes. Feldman and Reiter (2025) (Feldman & Reiter, 2025) introduce outcome-assisted multiple imputation for missing treatments in observational studies, advocating for the inclusion of both covariates and outcomes to improve the accuracy of treatment effect estimation.

Several papers address methodological challenges in clinical trials and epidemiological studies. Xu et al. (2025) (Xu et al., 2025) propose four methods to control Type I error in two-step hybrid clinical trials that incorporate real-world data, offering improvements over existing approaches like the Bayesian power prior. Fabre Ferber et al. (2025) (Fabre Ferber et al., 2025) explore interpolation techniques for data augmentation in geo-referenced data, demonstrating the benefits of Gaussian processes and kriging for predicting weed presence in sugarcane plots. For social media analysis, Meagher and Friel (2025) (Meagher & Friel, 2025) employ Hawkes processes to model online discussion dynamics, capturing phenomena like superspreading and circadian rhythms. Pais et al. (2025) (Pais et al., 2025) introduce a pseudolikelihood-based Mixture of Unigrams (MoU) model for topic modeling of free-response text data from complex surveys, addressing the challenges posed by informative sampling.

Genomic analysis and signal detection are the focus of two preprints. Villatoro-García et al. (2025) (Villatoro-García et al., 2025) present GSEMA, a novel gene set enrichment meta-analysis method that leverages single-sample enrichment scoring, improving biological interpretation and controlling false positive rates. Zhang et al. (2025) (Zhang et al., 2025) introduce dBiRS, a distributed version of the binary and re-search algorithm for efficient whole-genome signal region detection, accommodating both quantitative and binary traits. Fields et al. (2025) (Fields et al., 2025) propose a family of one-sided SPRT-type tests for sequentially testing whether a stochastic process is generated by a known Markov chain, demonstrating adaptive sample size determination based on data-driven estimators. Breza et al. (2025) (Breza et al., 2025) develop a framework for discovering generalizable archetypes in policy interventions, emphasizing the importance of acknowledging ignorance and eliciting further evidence when generalizability is uncertain.

Several studies apply statistical methods to specific real-world problems. Drew et al. (2025) (Drew et al., 2025) introduce a Bayesian record linkage model for spatial location data, applied to estimating tree growth from overlapping LiDAR scans. Pelella et al. (2025) (Pelella et al., 2025) propose a car-following model that incorporates behavioral adaptation to road geometry, improving the accuracy of free-flow and car-following dynamics simulations. Rajapaksha et al. (2025) (Rajapaksha et al., 2025) develop a Bayesian learning model for joint risk prediction of alcohol and cannabis use disorders, demonstrating improved predictive accuracy compared to univariate models. Leach et al. (2025) (Leach et al., 2025) assess changes in the 100-year return value of climate model variables, finding stronger evidence for changes in solar irradiance and temperature extremes compared to wind variables. Finally, Fonseka et al. (2025) (Fonseka et al., 2025) introduce PoPStat, a novel metric correlating population pyramid deviations with disease mortality, offering insights into demographic determinants of health.

Paper Highlights

Computationally Efficient Whole-Genome Signal Region Detection for Quantitative and Binary Traits

Computationally Efficient Whole-Genome Signal Region Detection for Quantitative and Binary Traits by Wei Zhang, Fan Wang, Fang Yao https://arxiv.org/abs/2501.13366

Caption: The figure displays the probability of association with a trait across genomic positions for three different signal region detection methods: dBiRS, Q-SCAN, and KnockoffScreen. dBiRS demonstrates superior performance by identifying distinct peaks exceeding the significance threshold (dotted red line) with greater precision and fewer false positives compared to the other methods, aligning with its higher detection rate and lower false discovery rate observed in simulations. The red arrows indicate the true locations of simulated signals.

The identification of genetic signal regions is crucial for understanding the genetic basis of complex traits and diseases. This paper introduces dBiRS (distributed Binary and Re-search), a novel algorithm that significantly advances whole-genome signal region detection. dBiRS addresses limitations of existing scan-based methods and offers superior power and computational efficiency, particularly for whole-genome sequencing (WGS) studies. It accommodates both binary and continuous traits, making it a versatile tool for genetic research.

The dBiRS algorithm leverages a distributed computing framework, enabling parallel processing of large genomic datasets. It employs a two-stage approach: first, the genome is divided into blocks, each analyzed locally using the BiRS algorithm; then, detected signal regions and their statistics are aggregated on a central machine for final evaluation. This distributed approach significantly enhances computational efficiency, especially for WGS data with millions of variants.

A key innovation of dBiRS is its use of a novel infinity-norm test statistic based on summary statistics from a generalized linear model (GLM): g(η) = Xγ + Gβ, where η represents the expected outcome, X includes covariates, and G is the genotype vector. This GLM framework allows dBiRS to adjust for covariates and handle both continuous and binary outcomes. A second layer of BiRS on the central machine, using a multiplier bootstrap, reassesses the signal regions within significant blocks, ensuring control over family-wise error rates (FWER) and false discovery rates (FDR).

Simulations demonstrate dBiRS's superior performance compared to existing methods like QSCAN and KnockoffScreen. dBiRS maintains accurate FWER control while achieving higher detection rates and lower FDRs across various signal strengths and distance parameters. Applied to UK Biobank WES data, dBiRS identified known and novel associations with fluid intelligence and prospective memory, validating its practical utility. The discovery of rare variants near newly implicated genes provides valuable insights into the genetic basis of cognitive traits and neurodegenerative disorders. dBiRS represents a powerful and scalable tool for whole-genome signal region detection, promising to accelerate genetic research and improve our understanding of complex traits.

Adapting OpenAI's CLIP Model for Few-Shot Image Inspection in Manufacturing Quality Control

Adapting OpenAI's CLIP Model for Few-Shot Image Inspection in Manufacturing Quality Control: An Expository Case Study with Multiple Application Examples by Fadel M. Megahed, Ying-Ju Chen, Bianca Maria Colosimo, Marco Luigi Giuseppe Grasso, L. Allison Jones-Farmer, Sven Knoth, Hongyue Sun, Inez Zwetsloot https://arxiv.org/abs/2501.12596

Caption: The image illustrates the core concept of few-shot learning with CLIP. The left side depicts how image embeddings (I1...IN) are generated from a small set of labeled images and compared against text embeddings (T1...TN) derived from class descriptions (e.g., "dog," "car"). The right side demonstrates the classification process where a new image's embedding is compared to the learned embeddings for classification based on cosine similarity.

This paper explores a simplified approach to image-based quality control using OpenAI's CLIP model, adapted for few-shot learning. This method addresses the challenges of applying powerful computer vision models like CLIP to industrial settings, where large labeled datasets are often unavailable. By leveraging CLIP's ability to learn from limited examples, the researchers demonstrate a practical and efficient method for quality inspection across various manufacturing scenarios.

The core methodology involves generating embeddings from a small set of labeled nominal and defective images using CLIP's visual encoder. New images are then classified based on their cosine similarity to these learned embeddings. This approach bypasses the need for extensive training data and complex model fine-tuning, making it readily accessible for quality engineers. The cosine similarity, s(x, y), is calculated as s(x, y) = cosSim(fθ(x), gΦ(y)) = (fθ(x) ⋅ gΦ(y)) / (||fθ(x)||||gΦ(y)||), where fθ(x) is the test image embedding and gΦ(y) is each learning set embedding.

Across five diverse case studies, the research demonstrates that CLIP's few-shot learning can achieve high accuracy with limited data in specific applications. For instance, metallic pan surface inspection achieved 94% accuracy with only 10 examples per class. Similarly, stochastic textured surface evaluation achieved 97% accuracy with 50 examples, showcasing CLIP's ability to capture subtle variations. However, performance degraded in complex multi-component scenes like automotive assembly inspection, highlighting the limitations of this simplified approach in such scenarios.

Outcome-Assisted Multiple Imputation of Missing Treatments

Outcome-Assisted Multiple Imputation of Missing Treatments by Joseph Feldman, Jerome P. Reiter https://arxiv.org/abs/2501.12471

Caption: Empirical Coverage Rates of Imputation Methods for Missing Treatment Data

This paper introduces Outcome-assisted Multiple Imputation of Treatments (OMIT), a novel method for handling missing treatment data in observational studies. Missing treatment data poses a significant challenge for causal inference, as traditional approaches like complete-case analysis or simple imputation can lead to biased estimates. OMIT addresses this issue by incorporating both covariates and outcomes into the imputation model, leading to more accurate imputations and better causal effect estimation.

The key innovation of OMIT lies in its use of the outcome variable to sharpen the predictive probabilities for missing treatments. Instead of relying solely on propensity scores (treatment assignment given covariates), OMIT combines a propensity score model, p(t = 1 | x), with an outcome model, f(y | x, t), to create a more informative imputation model: p(t = 1 | y, x) ∝ p(t = 1 | x)f(y | x, t). This approach leverages the relationship between treatment, covariates, and outcome to improve the accuracy of the imputed treatments.

The authors provide theoretical justification for OMIT by deriving an expression for the bias of the inverse probability weighted (IPW) estimator under multiple imputation of missing treatments. They show that OMIT can minimize this bias, especially when the outcome model is correctly specified. Simulations across various scenarios demonstrate that OMIT consistently outperforms naive imputation and complete-case analysis, achieving higher coverage rates and lower mean squared errors. Importantly, OMIT exhibits robustness even with misspecified outcome models, highlighting the benefits of incorporating outcome information.

Generalizability with ignorance in mind: learning what we do (not) know for archetypes discovery

Generalizability with ignorance in mind: learning what we do (not) know for archetypes discovery by Emily Breza, Arun G. Chandrasekhar, Davide Viviano https://arxiv.org/abs/2501.13355

Caption: Comparison of Treatment Effect Prediction Methods

This paper introduces a novel framework for assessing generalizability in treatment effect studies. Recognizing that treatment effects may not be universally transferable, the authors propose a method to identify "generalizable archetypes" – subgroups where treatment effects are consistent and predictive for others within the group. Crucially, the framework also acknowledges the "basin of ignorance" – contexts where data is insufficient for reliable predictions, highlighting areas requiring further research.

The methodology involves a two-step process. First, unbiased estimates of the conditional average treatment effect and its variance are obtained for small groups defined by observable characteristics. Second, each group is assigned to either an archetype or the basin of ignorance by optimizing an objective function. This function balances the goals of maximizing predictions (claiming generalizability) while minimizing prediction errors, acknowledging a fixed cost for admitting ignorance and needing further data collection.

Applied to a multi-faceted anti-poverty program across six countries, the method reveals heterogeneous treatment effects and identifies a basin of ignorance for the wealthiest households, where traditional pooled analyses yield ambiguous results. Simulations further demonstrate the method's superiority over existing alternatives, achieving significant reductions in prediction error. This framework offers a more nuanced approach to understanding heterogeneity and generalizability, with important implications for both research and policy.

Conclusion

This newsletter highlights a diverse range of methodological advancements in statistics and machine learning. From genomic analysis to manufacturing quality control and causal inference, these preprints offer innovative solutions to complex challenges. The development of dBiRS provides a powerful tool for whole-genome signal region detection, while the adaptation of CLIP for few-shot learning simplifies image-based quality inspection. The introduction of OMIT and the framework for generalizable archetypes address critical issues in causal inference, emphasizing the importance of accurate imputation and acknowledging the limits of generalizability. These advancements collectively contribute to a more robust and nuanced understanding of data across various domains, paving the way for more informed decision-making in research and practice.