Loading...

This newsletter explores a collection of papers showcasing methodological advancements and applications in statistical modeling and data analysis. Several papers focus on causal inference and prediction. Jiang et al. (2024) introduce a framework for * longitudinal causal inference* with

The development of novel statistical models for specific applications is another key theme. Rios and Xu (2024) propose a * Bayesian D-optimal experimental design* for optimizing

Spatial and network data analysis also feature prominently. MacDonald et al. (2024) introduce * mesoscale two-sample testing for network data*. Eckardt et al. (2024) develop

Finally, privacy and data quality are addressed. Cho and Awan (2024) introduce * Semi-DP*, extending

*Longitudinal Causal Inference with Selective Eligibility by Zhichao Jiang, Eli Ben-Michael, D. James Greiner, Ryan Halen, Kosuke Imai* https://arxiv.org/abs/2410.17864

*Caption: Estimated Average Treatment Effects of PSA Intervention*

Dropout in longitudinal studies poses a significant threat to the validity of causal inferences. While previous research has largely focused on missing outcomes due to treatment, this paper addresses a critical yet often overlooked source of dropout: *selective eligibility*. Selective eligibility occurs when a unit's eligibility for subsequent treatments is influenced by their prior treatment history. This is distinct from "truncation by death," as dropout occurs *after* observing the outcome but *before* the next treatment, rendering standard dropout approaches inapplicable.

This paper proposes a comprehensive methodological framework for longitudinal causal inference in the presence of selective eligibility. The authors introduce two novel causal estimands: the * average eligible treatment effect (ETE)* and the

Under a generalized version of sequential ignorability, the authors derive two nonparametric identification formulas. One leverages outcome regression, while the other utilizes inverse probability weighting. To enhance estimation efficiency, they derive the efficient influence function (EIF) for each estimand, leading to doubly robust estimators. These estimators maintain consistency even if either the propensity score model or the outcome regression and eligibility models are misspecified, extending the classic doubly robust estimator to longitudinal studies with selective eligibility.

The practical utility of this framework is demonstrated through an application to a randomized controlled trial evaluating a pre-trial risk assessment instrument (PSA) in the criminal justice system. In this context, selective eligibility arises due to recidivism, as arrestees are only eligible for the PSA intervention upon rearrest. The analysis examines the PSA's influence on judicial decisions and subsequent negative outcomes (failure to appear, new criminal activity, and new violent criminal activity) for up to three arrests.

The findings reveal that providing the PSA generally increases agreement between judicial decisions and PSA recommendations, with statistically significant effects observed for the first two arrests. For the third arrest, the effect is significant only when the PSA was provided for the previous two arrests. Importantly, the analysis indicates minimal impact of the PSA on subsequent negative outcomes, aligning with prior analyses focused solely on first arrests.

*Formal Privacy Guarantees with Invariant Statistics by Young Hyun Cho, Jordan Awan* https://arxiv.org/abs/2410.17468

*Caption: Comparison of L2 costs between Semi-DP and naive mechanisms for two probability models.*

While differential privacy (DP) offers robust privacy protection for released query outputs, it faces challenges when certain statistics, known as invariants, are also publicly available. These invariants can leak information about the underlying data, potentially compromising individual privacy. Motivated by the 2020 US Census, which released both DP outputs and true statistics, this paper introduces Semi-Differential Privacy (Semi-DP), a novel framework that addresses the limitations of traditional DP in the presence of invariants.

Semi-DP refines the notion of adjacency in DP by restricting the scope to *invariant-conforming databases* – databases sharing the same invariant value as the confidential data. Within this restricted space, Semi-DP defines adjacency using a *semi-adjacent parameter, a(t)*, which quantifies the worst-case impact of replacing an individual's data while maintaining the invariant. Formally, *a(t) = sup<sub>i∈[n]</sub> sup<sub>x,y∈Di</sub> inf{d(X,Y) : X,Y ∈ D<sub>t</sub>, X<sub>i</sub> = x, Y<sub>i</sub> = y}*, where *D<sub>t</sub>* is the set of invariant-conforming databases and *d(X,Y)* is an adjacency metric. This ensures that even the most challenging data substitutions remain indistinguishable to an adversary.

The authors develop specialized mechanisms satisfying Semi-DP, including adaptations of the Gaussian mechanism and the optimal K-norm mechanism for rank-deficient sensitivity spaces. The optimal K-norm mechanism leverages the convex hull of the sensitivity space within the subspace spanned by the sensitivity space, minimizing noise addition while preserving privacy. The application of Semi-DP to contingency table analysis, relevant to the US Census, demonstrates how to release private outputs while preserving true marginal counts.

Numerical experiments showcase the superior performance of the Semi-DP Gaussian mechanism over naive approaches, consistently achieving lower L2 costs across various contingency table sizes and probability models. Similarly, the optimal K-norm mechanism under Semi-DP significantly reduces L2-costs compared to naive *l<sub>1</sub>*, *l<sub>2</sub>*, and *l<sub>∞</sub>*-norm mechanisms.

A crucial contribution of this work is the privacy analysis of the 2020 US Decennial Census using the Semi-DP framework. This analysis reveals that the effective privacy guarantees are weaker than advertised because the reported privacy parameters do not account for the released invariants. For instance, the Census mechanism satisfies *(D<sub>t</sub>, A<sub>2</sub>, 10.24)-zCDP* for state population totals, contrasting with the reported *(D, A<sub>1</sub>, 2.56)-zCDP*. Converting to (*ε, δ*)-DP with *δ = 10<sup>-10</sup>* yields an actual guarantee of *ε = 40.95057*, significantly higher than the advertised *ε = 17.91528*.

The paper also acknowledges limitations of Semi-DP, particularly regarding individual-level privacy. The restricted adjacency notion can weaken privacy compared to traditional DP, especially with adversary side information. Future research should focus on mitigating these vulnerabilities, potentially by refining adjacency definitions or gaining deeper understanding of invariant structures.

*Saddlepoint Monte Carlo and its Application to Exact Ecological Inference by Théo Voldoire, Nicolas Chopin, Guillaume Rateau, Robin J. Ryder* https://arxiv.org/abs/2410.18243

*Caption: Comparison of Relative Standard Error in SPMC with and without Tilting*

Ecological inference (EI) often involves analyzing aggregate data while individual-level data remains hidden. Traditional EI methods often rely on approximations due to the computational complexity of exact inference, especially with large datasets and complex models. This paper introduces saddlepoint Monte Carlo (SPMC), a novel method for obtaining unbiased, low-variance estimates of marginal likelihoods in such scenarios. The method utilizes importance sampling of the characteristic function, drawing insights from the saddlepoint approximation with exponential tilting, and is particularly well-suited for models belonging to the exponential family.

The core of SPMC lies in the inversion formula: * f<sub>AX</sub>(y) = (2π)<sup>-d<sub>y</sub></sup>∫<sub>[-π,π]<sup>d<sub>y</sub></sup></sub> exp(-iz<sup>T</sup>y)φ<sub>X</sub>(A<sup>T</sup>z)dz*, where

The power of SPMC is demonstrated through its application to French election data, a classic EI problem. Analyzing the 2007 presidential election, the authors show that approximating the multinomial distribution of votes with a Gaussian, a common practice in some EI studies, leads to substantial bias. SPMC, in contrast, allows for exact inference, revealing, for instance, significant discrepancies between estimated abstention rates and those reported in exit polls (80% vs. 64%). Further analysis of the 2022 presidential election, incorporating population density as a covariate, demonstrates SPMC's ability to handle complex models. The results indicate a positive correlation between population density and the probability of voting for Macron (center) after initially supporting Mélenchon (left/far-left). Finally, the analysis of the 2024 legislative election data showcases SPMC's scalability, efficiently handling constituencies with varying numbers of candidates.

SPMC represents a significant advance in EI, enabling exact inference where previous methods relied on approximations. Its low variance and scalability make it suitable for large datasets and complex models, opening new research avenues in electoral sociology and other fields dealing with aggregate data. Beyond EI, potential applications include data privacy and inverse problems, highlighting SPMC's broad utility in computational statistics.

*Towards more realistic climate model outputs: A multivariate bias correction based on zero-inflated vine copulas by Henri Funk, Ralf Ludwig, Helmut Kuechenhoff, Thomas Nagler* https://arxiv.org/abs/2410.15931

*Caption: The image illustrates the three-step process of Vine Copula Bias Correction (VBC) for climate model data. First, vine copula modeling estimates dependencies, accounting for zero-inflated margins. Then, a Rosenblatt transformation corrects the model data to a uniform domain, considering discrete-continuous mixtures. Finally, delta mapping projects the corrected data to the reference distribution using multiplicative and additive projections.*

Climate models, essential for understanding climate variability and projecting future scenarios, are prone to biases arising from incomplete representations of physical processes. These biases can significantly distort the multivariate distributional shape of climate variables, impacting downstream applications such as hydrological modeling and extreme event analysis. Existing bias correction methods, like univariate quantile mapping (UBC) and multivariate bias correction (MBCn), struggle to address the zero-inflation often present in high-resolution climate data, especially for variables like precipitation and radiation. This necessitates a more sophisticated approach that accurately captures the complexities of zero-inflated climate data.

This paper introduces * Vine Copula Bias Correction for partially zero-inflated margins (VBC)*, a novel multivariate bias correction methodology. This technique leverages the flexibility of vine copulas, renowned for their ability to model high-dimensional, nonlinear dependencies, and extends their application to accommodate zero-inflated variables. A key theoretical contribution of VBC is a generalized vine density decomposition formula, extending the work of Bedford and Cooke (2001, 2002), which allows for the inclusion of variables with mixed discrete and continuous components. VBC corrects model data by transforming it to a uniform domain using a modified Rosenblatt transformation that accounts for zero-inflation, and then projects it to the reference distribution using a modified delta mapping procedure.

The performance of VBC was evaluated against UBC and MBCn using a real-world application focusing on five climate variables from the CRCM5-LE dataset over three Bavarian catchments. Results demonstrate VBC's consistent superiority in terms of distributional similarity to the reference data, measured by the 2nd Wasserstein distance (* W²*). VBC improved similarity in at least

The authors also introduce the * Model Correction Inconsistency (MCI)* metric, which assesses the preservation of weather patterns within the model data after correction. The MCI measures the absolute difference in non-exceedance probabilities between the model data and its bias-corrected counterpart. Results show that while UBC, by design, is the least invasive method, VBC outperforms MBCn in preserving the course of weather, exhibiting lower average MCI values and less seasonal variation in inconsistency. This suggests that VBC effectively corrects biases while retaining the model's inherent temporal dynamics.

This newsletter has highlighted significant advances in statistical methodology and their application across diverse domains. From tackling selective eligibility in longitudinal causal inference to developing novel privacy-preserving techniques and enhancing the realism of climate model outputs, the papers discussed showcase the power of innovative statistical thinking. The introduction of the ETE and EOE estimands by Jiang et al. offers a robust framework for analyzing complex longitudinal studies, while the Semi-DP framework by Cho and Awan addresses critical limitations of traditional differential privacy in the presence of invariant statistics. The development of SPMC by Voldoire et al. enables exact inference in challenging ecological inference problems, and the VBC method by Funk et al. provides a powerful tool for correcting biases in high-resolution climate data. These contributions collectively push the boundaries of statistical modeling and data analysis, offering valuable insights and tools for researchers across various fields.