Statistics

New submissions

Submissions received from Wed 6 May 26 to Thu 7 May 26, announced Fri, 8 May 26

New submissions
Cross-lists
Replacements

[ total of 164 entries: 1-164 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Fri, 8 May 26

[1] arXiv:2605.05255 [pdf, ps, other]: Title: Prediction of Drought and Flash Drought in Africa at the Seasonal-to-Subseasonal Scale using the Community Research Earth Digital Intelligence Twin Framework

Authors: Stuart Edris, Amy McGovern, Jason Hickey

Subjects: Applications (stat.AP); Atmospheric and Oceanic Physics (physics.ao-ph)

Droughts and flash droughts (rapidly developing droughts; FDs) remain impactful events that are known to desiccate landscape and destroy crops. In particular, droughts in Africa are often more impactful than in other locations, such as the United States or Europe, due to many regions in Africa heavily depending on local agriculture for sustenance. In recent years, large machine learning (ML) models, such as GraphCast and AIFS, have emerged as effective tools for global weather prediction. However, sparse data observations and few ML studies in Africa have left it unclear if these ML models retain their skill when focused on Africa. As such, this project seeks to examine the predictability of drought and FD in Africa using a CrossFormer model based on the Community Research Earth Digital Intelligence Twin (CREDIT) framework developed by NSF NCAR. Our CrossFormer model, termed DroughtFormer, incorporates variables from the ERA5 and GLDAS2 reanalyses and the IMERG and MODIS satellite observations, and employs dry air mass and moisture conservation, to predict soil moisture, vegetation health, and other drought-related surface variables. While DroughtFormer displayed lower accuracy in predicting precipitation and FD indices, it showed significant skill in predicting the remaining variables, delivering stable and skillful forecasts out to 90-day lead times (either beating out or having comparable skill to climatology). In particular, DroughtFormer skillfully represented climate anomalies for key variables, such as soil moisture (though it struggled with the magnitude of the anomalies). Thus, DroughtFormer showed significant promise in representing and predicting agricultural level drought in a region that is heavily impacted by drought events.
[2] arXiv:2605.05262 [pdf, ps, other]: Title: Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning

Authors: Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song

Comments: Preprint, 9 pages, 5 figures

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

We formalize Rollout Informativeness under a Fixed Budget (RIFB) as the expected non-vanishing policy-gradient mass that a tool-use rollout set injects into Group Relative Policy Optimization (GRPO). We prove that any budget-agnostic independent sampler suffers a collapse rate bounded away from zero for hard prompts regardless of the budget. Motivated by this, we recast intermediate state selection as a monotone submodular maximization problem, where a greedy one-step selector enjoys a 1 minus 1/e approximation guarantee.
Our Uncertainty-aware Upper Confidence Bound (UUCB) terms arise as closed-form marginal gains of this objective. This turns the token-level entropy bonus from an empirical trick into an analytic consequence of the formulation. We present InfoTree, a training-time tree-search framework coupling UUCB with a learned Adaptive Budget Allocator (ABA) and an asynchronous Speculative Expansion scheme.
ABA rescues prompts whose initial tree is wasted on uniform outcomes, lifting the mixed-outcome ratio from 58.1 percent to 76.3 percent with less than 5 percent budget overhead. Speculative Expansion reduces wall-clock overhead from 14.3 percent to 4.8 percent by tolerating bounded staleness in UUCB scores.
Across nine benchmarks spanning math reasoning (AIME 2024 and 2025, MATH-500, OlympiadBench, USAMO), web-search agents (GAIA, HLE-100, BrowseComp-lite), and tool-rich coding and OS agents (APPS-verified, AgentBench-OS), InfoTree outperforms flat GRPO, DeepSearch, Tree-GRPO, AT2PO, CW-GRPO, and RC-GRPO. Head-to-head compositions with Tree-GRPO prefix sharing and CW-GRPO contribution weights deliver further gains, confirming that our selector operates orthogonally to rollout reuse and trajectory re-weighting. A 5 by 5 by 5 robustness grid reveals that over three quarters of the hyperparameter space lies on a performance plateau, confirming UUCB robustness.
[3] arXiv:2605.05270 [pdf, ps, other]: Title: Forecasting Oncology Demand Trends with Boosting-Based Bayesian Conjugate Models

Authors: Ademir Batista dos Santos Neto, Tiago Alessandro Espinola Ferreira, Paulo Renato Alves Firmino

Comments: 18 pages, 3 figures

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)

Accurate trend forecasting in healthcare time series is essential for planning and resource allocation. This paper proposes a Bayesian framework for predicting oncology demand trends, modeling weekly appointments as a Poisson process with a Gamma prior to the demand rate. To enhance adaptability and capture persistent directional patterns, we incorporate a residual-based boosting mechanism grounded in a Gamma-Log-Normal conjugate structure. This boosting approach allows the model to track both short- and long-term trend shifts while maintaining the analytical tractability of conjugate Bayesian updating. The methodology was evaluated on real oncology service data from Cariri, Ceara, Brazil, and compared against established baselines, including linear regression, ARIMA, naive forecasting, LSTM neural networks, and XGBoost. Results showed that the proposed model outperforms competing methods in trend detection accuracy, with gains in terms of percentage of correct direction of 38.25% in relation to the second best approach in some cases.
[4] arXiv:2605.05359 [pdf, ps, other]: Title: Bayesian inference of sparsity in stable vector autoregressive processes

Authors: Sarah E. Heaps, Ian H. Jermyn, Yujiang Wang, Darren J. Wilkinson

Subjects: Methodology (stat.ME)

Advances in sensing technology have made it possible to collect large volumes of high-dimensional time-series data. In fields like genetics and neuroscience, key questions concern whether directed relationships between variables can be learned from these data. To this end, graphical vector autoregressions are a popular tool because zeros among the autoregressive coefficients and error precision matrix have natural interpretations in terms of Granger non-causality and contemporaneous conditional independence. In applications where system dynamics are subject to functional or structural constraints, assuming the process is stable can be advantageous. However, enforcing stability demands restricting the autoregressive coefficients to lie in a constrained space with a complex geometry called the stationary region. The resulting inferential challenges are compounded when sparsity is also a requirement. Working in the Bayesian paradigm, we tackle the problem of developing a prior that simultaneously enforces stationarity and sparsity through parameter expansion, constructing a spike-and-slab prior with support constrained to the stationary region. A mixture of G-Wishart distributions provides a sparse prior for the error precision matrix. Computational inference is carried out using Metropolis-within-Gibbs, exploiting the No-U-Turn Sampler and reversible-jump steps. We demonstrate the inferential and predictive benefits of our approach through simulations and applications in macroeconomics and neuroscience.
[5] arXiv:2605.05371 [pdf, ps, other]: Title: Multilevel Regression Modeling of Covariance Matrix Outcomes

Authors: Michelle Murphy Green, Xi Luo, Brian S. Caffo, Yi Zhao

Subjects: Methodology (stat.ME)

Covariance matrix outcomes arise naturally in neuroimaging experiments to study brain functional connectivity. It is also of interest to understand how brain network organization varies with subject-level covariates. Existing covariance regression methods operate in a single-level framework and do not accommodate the hierarchically nested data structure in which subjects are grouped into clusters, such as age cohorts in lifespan studies. A Multilevel Covariate-Assisted Principal Regression (MCAP) framework is introduced, which identifies, for each cluster, a linear projection such that a generalized linear mixed effects model can be formulated with the covariates. The cluster-specific projections are modeled on the unit sphere via a von Mises-Fisher distribution, enabling principled borrowing of information across clusters. Model parameters are estimated by maximizing a hierarchical likelihood. For inference, a two-stage bootstrap procedure is proposed. Asymptotic properties of the estimators are established. Simulation studies demonstrate that MCAP substantially outperforms single-level competitors in estimating regression coefficients. Applied to the Human Connectome Project Lifespan Study spanning ages from five to ninety, MCAP identifies a dominant spectral brain network capturing age and sex effects on functional connectivity, and reveals findings including the convergence of neural reorganization patterns in late adulthood and the coordinated lifespan modulation of cross-network regions linked to language and executive function.
[6] arXiv:2605.05384 [pdf, ps, other]: Title: Improving Minority Population Sampling with BISG Probabilities: Evidence from a Survey of Jewish Americans

Authors: Kyla Chasalow, Eitan Hersh, Kosuke Imai, Laura Royden

Subjects: Applications (stat.AP); Methodology (stat.ME)

Sampling geographically dispersed minority populations poses substantial challenges when individual group membership cannot be directly observed. Although stratified sampling can offer efficiency gains, these gains are typically modest unless the minority population is highly concentrated within a small number of strata. In this paper, we propose using Bayesian Improved Surname Geocoding (BISG) to enhance the efficiency of minority population sampling. BISG generates individual-level probabilities of minority group membership based on names and residential addresses. We incorporate these probabilities into a stratified Poisson probability sampling design. Applying the proposed approach to a national survey of Jewish Americans, we find that our estimates closely align with those from a large-scale Pew Research Center survey of the same population, which relied on a substantially more expensive sampling strategy involving geographic stratification and screening. At a fraction of the cost, our survey reproduces nearly identical patterns observed by Pew, including estimates of religious denominations and participation in specific religious activities.
[7] arXiv:2605.05396 [pdf, ps, other]: Title: Bayesian Region Selection and Prediction in Poisson Regression with Spatially Dependent Global-Local Shrinkage Prior

Authors: Zihan Zhu, Xueying Tang, Shuang Zhou

Comments: 24 pages, 7 figures

Subjects: Methodology (stat.ME); Applications (stat.AP)

High-dimensional spatially correlated covariates are common in regression models encountered in environmental sciences and other fields. In such models, the regression coefficients often exhibit a sparse structure with spatial dependence. Although standard variable selection approaches can help detect the sparse structure, incorporating the dependence into variable selection helps recover spatially contiguous signals and improves prediction accuracy. Motivated by a real-world challenge in hurricane count prediction, we propose a novel neighborhood-structured global-local shrinkage prior for prediction and region selection in Poisson regression with spatial covariates. The proposed prior combines the Conditional Auto-Regressive (CAR) prior with a Super Heavy-tailed prior to introduce spatial dependence among the coefficients while ensuring appropriate shrinkage effects for covariate selection. We develop an efficient Metropolis-within-Gibbs sampler for computation that accommodates the count data. Extensive simulation studies demonstrate that the proposed model excels when signals are weak and adjacent and the spatial dependence in covariates is strong. In the application of hurricane prediction from the north Atlantic, our method outperforms traditional regression-based approaches and rivals the benchmark oracle model.
[8] arXiv:2605.05399 [pdf, ps, other]: Title: Causal Effect Estimation on Restricted Mean Survival Time in Case-Cohort Studies via a Matching Design

Authors: Andy Ni, Wei-En Lu, Bo Lu

Subjects: Methodology (stat.ME)

In large observational studies, the case-cohort design is commonly used to reduce the cost associated with covariate measurement. For survival outcomes, literature has suggested that the restricted mean survival time (RMST) be a more appropriate marginal causal effect measure than the hazard ratio. In this paper, we develop a marginal causal effect estimation method for RMST difference under the stratified case-cohort design. We adjust for measured confounders using an innovative template matching design. Compared with conventional matching designs, template matching allows greater flexibility in the sample sizes of the exposed and unexposed groups. We establish the asymptotic properties of the proposed causal effect estimators and develop a bootstrap procedure to estimate their variances. By conducting comprehensive simulation studies, we evaluate the finite sample performance of the proposed estimators, demonstrate the advantage of template matching over conventional matching, and compare between matching on propensity score and matching on covariates. Finally, we apply the proposed methods to the Atherosclerosis Risk in Communities (ARIC) Study to estimate the marginal causal effect of serum hs-CRP level on the coronary heart disease-free survival.
[9] arXiv:2605.05428 [pdf, ps, other]: Title: Parameter estimation for kappa distributions using the EM algorithm in the superstatistical framework

Authors: Leonardo Sebastian Herrera, Sergio Davis

Subjects: Methodology (stat.ME); Statistical Mechanics (cond-mat.stat-mech); Plasma Physics (physics.plasm-ph)

Kappa distributions are widely used in space plasma physics to model velocity distribution functions with extended tails. However, parameter estimation in these distributions presents a fundamental challenge: the kappa distribution does not belong to the exponential family, which prevents the direct application of analytically tractable maximum likelihood methods. In this work we propose a solution to this problem based on data augmentation: we introduce the inverse temperature $\beta$ as a gamma-distributed latent variable, thereby recovering the exponential family structure in the complete-data likelihood. This enables an implementation of the expectation-maximization (EM) algorithm in analytically closed form, with E-step and M-step derived from sufficient statistics. Our approach is agnostic with respect to the underlying physical mechanism generating fluctuations of $\beta$, which is a central aspect of the Beck-Cohen superstatistics framework, allowing a statistically rigorous treatment without compromising physical interpretability. We demonstrate that the method converges to the usual maximum likelihood estimators by applying it to synthetic data. Our results suggest that EM offers a computationally efficient and conceptually clear alternative for inference in superstatistical systems with temperature fluctuations.
[10] arXiv:2605.05432 [pdf, ps, other]: Title: Direct Estimation of Schrödinger Bridge Time-Series Drifts: Finite-Sample, Asymptotic, and Adaptive Guarantees

Authors: Othmane Mazhar, Huyên Pham

Comments: 36 pages, 3 figures, 8 tables

Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)

We study nonparametric estimation of Schr\"odinger bridge (SB) drifts from i.i.d.\ data observed on a single time interval. Starting from the conditional-ratio form of the Schr\"odinger bridge time-series (SBTS) drift formula, we analyze a direct Nadaraya--Watson plug-in estimator built from kernelized numerator and denominator terms. Unlike recent SB analyses based on entropic-OT potentials, Sinkhorn iterations, or iterative bridge solvers, our approach works directly at the drift level and isolates \emph{statistical error} from optimization, approximation, and discretization error.
Under H\"older regularity, a marginal-density floor, and bounded support, we prove a uniform non-asymptotic bound for admissible bandwidth pairs, a pointwise CLT under genuine undersmoothing, and an adaptive bandwidth selector satisfying an oracle inequality. We also prove a pivot-local minimax lower bound which, through an explicit uniform pivot, yields a global minimax lower bound under transparent compatibility conditions; hence the adaptive selector is minimax-rate optimal up to logarithmic factors. Synthetic experiments provide theorem-targeted diagnostics for finite-sample scaling, Gaussian approximation, and adaptive behavior.
[11] arXiv:2605.05436 [pdf, ps, other]: Title: Estimating Implicit Regularization in Deep Learning

Authors: Joseph H. Rudoler, Kevin Tan, Giles Hooker, Konrad P. Kording

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Deep learning systems are known to exhibit implicit regularization (alt. implicit bias), favoring simple solutions instead of merely minimizing the loss function. In some cases, we can analytically derive the implicit regularization -- connecting it to an equivalent penalty that augments the learning objective. However, modern deep learning systems are complex, carrying modifications to the training procedure and architecture (e.g. early stopping, minibatching, dropout) whose effects are not always directly interpretable. Although estimating the resulting implicit regularization could aid theorists in algorithm design and practitioners in interpreting their hyperparameter choices, this problem has received little direct attention. It is also tractable: regularization makes weight updates deviate from loss gradients, promising a signal for identifying implicit bias. Here we provide gradient matching methods that can be used to empirically estimate the implicit regularization. Our method works on networks with known regularization, recovering popular explicit penalties like $\ell_1$ and $\ell_2$. It also replicates known implicit effects, like the quadratic weight penalty induced by early stopping in gradient descent, demonstrating that it can be used to test theories of implicit regularization. Crucially, because our method is empirical, it can handle implicit regularization in arbitrary networks. We demonstrate this use by characterizing the effects of dropout in deep networks, showing implicit $\ell_2$ effects in this popular method. Our work shows that practitioners can use gradient matching to understand regularization in networks with implicit biases that are too complicated to derive analytically.
[12] arXiv:2605.05446 [pdf, ps, other]: Title: Convexity in Disguise: A Theoretical Framework for Nonconvex Low-Rank Matrix Estimation

Authors: Chengyu Cui, Gongjun Xu

Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC)

Nonconvex methods have emerged as a dominant approach for low-rank matrix estimation, a problem that arises widely in machine learning and AI for learning and representing high-dimensional data. Existing analyses for these methods often require additional regularization to mitigate nonconvexity, even though such regularization is often unnecessary in practice. Moreover, most analyses rely on problem-specific arguments that are difficult to generalize to more complex settings. In this paper, we develop a theoretical framework for studying nonconvex procedures across a broad class of low-rank matrix estimation problems. Rather than focusing on a specific model, we reveal a fundamental mechanism that explains why nonconvex procedures can behave well in low-rank estimation. Our key device is a {\it benign regularizer} that does not alter the original update rule, but yields an equivalent locally strongly convex formulation of the algorithm. This perspective uncovers a disguised convexity inherent in the nonconvex procedure and provides a new route to theoretical guarantees for nonconvex low-rank matrix estimation.
[13] arXiv:2605.05458 [pdf, ps, other]: Title: Model Form Identification in High-Dimensional Functional Linear Regressions

Authors: Xingche Guo, Yehua Li, Pang Du

Subjects: Methodology (stat.ME)

High-dimensional functional data are becoming increasingly common in fields such as environmental monitoring and neuroimaging. This paper studies high-dimensional functional linear regression models that relate a scalar response to ultra-high-dimensional functional predictors, where each predictor is treated as a random element in an infinite-dimensional functional space. To address the dual challenges of high-dimensionality and model interpretability, we propose MoFI-FLR, a novel two-step estimation framework rooted in reproducing kernel Hilbert space (RKHS) theory. The first step employs a functional elastic-net penalty to screen out irrelevant covariates, while the second step decomposes each selected predictor's functional coefficient into an interpretable finite-dimensional simple component and an infinite-dimensional complementary complement. By penalizing only the complementary component, our method automatically distinguishes simple effects, which consist only of the simple component, from complex effects, which also include complementary deviations. Under mild regularity conditions, we establish non-asymptotic theoretical guarantees, demonstrating that MoFI-FLR consistently recovers the active covariates and accurately identifies their true functional forms. We develop a computationally efficient algorithm to implement the proposed method and evaluate its performance through comprehensive simulation studies and an application to Psychomotor Vigilance Task EEG data.
[14] arXiv:2605.05493 [pdf, ps, other]: Title: A renormalization-group inspired lattice-based framework for piecewise generalized linear models

Authors: Joshua C. Chang

Comments: Under review

Subjects: Methodology (stat.ME); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Statistics Theory (math.ST)

We formally introduce a class of models inspired by renormalization group (RG) theory, built on additive hierarchical expansions analogous to those appearing in functional ANOVA and mixed-effects models. Like ReLU convolutional neural networks, they are almost everywhere locally linear; unlike ReLU networks, their partition structure is explicit, interpretable, and easy to modify or constrain. In these models, one defines a multidimensional lattice partition of the input space and uses it to scaffold variations in regression parameters. Each dimension of the lattice corresponds to an attribute by which the statistics of the problem may vary. The parameters are themselves expressed in the form of an expansion, where each term captures variations relative to a lower (coarser) interaction scale. These models admit multiple equivalent interpretations: as piecewise GLMs, as hierarchical mixed-effects regressions, or as regression trees with structured parameter sharing. Since RG motivates the design of these models, we use techniques from statistical physics -- specifically replica analysis -- to study their generalization properties. Specifically, we analyze the behavior of the Watanabe-Akaike Information Criterion (WAIC) as a proxy for generalization loss. This analysis yields two practical results: (i) guidance on the lattice design as a function of dataset size and predictor dimensionality; and (ii) a principled scaling law for the regularization prior when adding higher-order terms to the expansion so that one can increase model complexity without an expected increase in generalization loss. We evaluate the methodology on public datasets and find performance competitive against both blackbox methods and other intrinsically interpretable approaches.
[15] arXiv:2605.05523 [pdf, ps, other]: Title: Permutation-preserving Functions and Neural Vecchia Covariance Kernels

Authors: Jian Cao, Nian Liu, Ying Lin

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)

We introduce a novel framework for constructing scalable and flexible covariance kernels for Gaussian processes (GPs) by directly learning the covariance structure under a regression-type parameterization induced by Vecchia approximations, using deep neural architectures. Specifically, we model kriging coefficients and conditional standard deviations, deterministic quantities that uniquely characterize the covariance, providing stable and informative learning targets. Exploiting the permutation-equivariant structure of conditioning sets in the Vecchia factorization, we derive a universal representation for permutation-preserving functions and design neural architectures that respect this symmetry, leading to improved training stability and data efficiency. The proposed approach enables expressive, non-stationary kernel learning while maintaining computational scalability, thereby bridging classical GP methodology with modern deep learning.
[16] arXiv:2605.05528 [pdf, ps, other]: Title: Spectral Collapsed Gibbs Sampler for Bayesian Sparse Regression

Authors: Andrew Chin, Xiyu Ding, Akihiko Nishimura

Subjects: Methodology (stat.ME); Computation (stat.CO)

Sparse regression based on global-local shrinkage priors are increasingly used for Bayesian modeling of modern high-dimensional data, but scaling up the Gibbs sampler for posterior inference remains a challenge. While much effort has gone into speeding up the high-dimensional coefficient update step, insufficient attention has been given to the potential poor mixing of the global scale parameter $\tau$ and of the overall sampler. One proposed remedy has been to marginalize out the coefficients when updating $\tau$. Here we show that, while this collapsed update was previously thought to require a Metropolis step, we can in fact sample directly and efficiently from the collapsed density. This is made possible by careful linear algebraic manipulations and a strategic per-Gibbs-scan spectral decomposition, allowing subsequent evaluations of the collapsed density across hundreds of values of $\tau$ at negligible cost. We combine this computational trick with adaptive numerical integration and inverse transform sampling to construct a direct sampler. This eliminates the need to tune Metropolis proposals and yields faster convergence and improved mixing. We demonstrate our method on two big data applications, fitting logistic regression under the horseshoe prior to datasets with design matrices of size 120,000 x 1,379 and 1,980 x 17,848.
[17] arXiv:2605.05539 [pdf, ps, other]: Title: Welcome to the Statverse: A Metaverse for Data Science

Authors: Ronny Vallejos, Miguel de Carvalho, Roberto Cruz, Nicolás Iribarra, José Allende, Edmundo Casas, Francisco Marshall, Sebastián Suárez, Leopoldo Cárdenas, Ozan Evkaya

Comments: 11 pages, 5 figures

Subjects: Other Statistics (stat.OT); Computation (stat.CO); Methodology (stat.ME)

This paper introduces the Statverse, a Metaverse framework designed to revolutionize statistical education in the digital age. Our key goal is to report our progress and encourage others to integrate similar strategies into their programs. The proposed framework seamlessly integrates the physical and digital realms to provide an immersive environment for the nuanced representation of complex statistical concepts. Finally, we discuss the potential impact of Statverse on advancing Statistical Education, offering a transformative approach to teaching and learning in the digital age. Statverse is the outcome of an academic partnership between Universidad T\'ecnica Federico Santa Mar\'ia (UTFSM) and the University of Edinburgh (UoE).
[18] arXiv:2605.05562 [pdf, ps, other]: Title: Socio-Conformal Calibration in Complex Survey Data: Marginal Validity Is Not Enough for Subgroup Reliability

Authors: Amir Rafe, Subasish Das

Subjects: Methodology (stat.ME); Computers and Society (cs.CY)

Machine-learning systems used in survey-based social measurement require uncertainty estimates that are reliable across population subgroups, not merely valid in aggregate. We study ordinal conformal prediction for five-level AI-attitude forecasting on the Pew American Trends Panel (Wave 152; n=4,591; 12 race x education subgroups), comparing standard split conformal, Mondrian (group-specific) conformal, and a regularized Mondrian comparator across 100 respondent-disjoint splits with survey-weighted evaluation. Standard conformal achieves nominal marginal coverage for all four base predictors but leaves weighted subgroup gaps of ~13 percentage points. For the strongest predictor (XGBoost), Mondrian worsens the fairness-efficiency trade-off: weighted set size rises by +0.036 (dz =1.66) while the weighted subgroup gap grows by +0.013 (dz =0.30). A regularized comparator that shrinks group thresholds toward the global quantile mitigates this instability (Delta gap = -0.001, Delta size = +0.012) but does not yield a decisive fairness gain. Failure analysis traces the mechanism to calibration-cell fragmentation interacting with group-specific confidence mismatch. The negative result persists across alternate outcome codings and subgroup granularities, demonstrating that nominal marginal validity is insufficient for subgroup reliability and that naive group-specific calibration is not a dependable fairness remedy in complex survey settings.
[19] arXiv:2605.05568 [pdf, ps, other]: Title: Relaxed Sparsest-Permutation Formulation for Causal Discovery at Scale

Authors: Sunmin Oh, Sang-Yun Oh, Gunwoong Park

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Despite the growing availability of large datasets, causal structure learning remains computationally prohibitive at scale. We revisit sparsest-permutation learning for linear structural equation models and show that exact Cholesky factorization is unnecessary for structure recovery. This observation motivates a support-level relaxation that searches for sparse triangular factors over a precision-support screening graph. The relaxed formulation can be efficiently evaluated via masked zero-fill incomplete Cholesky factorization, enabling scalable comparison of candidate orderings. At the population level, we establish soundness for Markov equivalence class (MEC) recovery under no-cancellation and sparsest Markov representation assumptions, as well as robustness to ordering misspecification. Motivated by these guarantees, we introduce SCOPE, a sparse-Cholesky pipeline that provides a scalable implementation of the relaxed formulation. Experiments on synthetic and real datasets demonstrate that SCOPE matches the MEC recovery accuracy of substantially slower baselines, while achieving significantly reduced runtime and scaling to 10k variables.
[20] arXiv:2605.05591 [pdf, ps, other]: Title: In-Context Positive-Unlabeled Learning

Authors: Siyan Liu, Yi Chang, Manli Cheng, Qinglong Tian, Pengfei Li

Comments: 12 pages, 1 figure, 3 tables

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)

Positive-unlabeled (PU) learning addresses binary classification when only a set of labeled positives is available alongside a pool of unlabeled samples drawn from a mixture of positives and negatives. Existing PU methods typically require dataset-specific training or iterative optimization, which limits their applicability when many tasks must be solved quickly or with little tuning. We introduce PUICL, a pretrained transformer that solves PU classification entirely through in-context learning. PUICL is pretrained on synthetic PU datasets generated from randomly instantiated structural causal models, exposing it to a wide range of feature-label relationships and class-prior configurations. At inference time, PUICL receives the labeled positives and the unlabeled samples as a single input and returns class probabilities for the unlabeled rows in one forward pass, with no gradient updates or per-task fitting. On 20 semi-synthetic PU benchmarks derived from the UCI Machine Learning Repository, OpenML, and scikit-learn, PUICL outperforms four standard PU learning baselines in average AUC and accuracy, and is competitive on F1-score. These results show that the in-context learning paradigm extends naturally beyond fully supervised tabular prediction to the semi-supervised PU setting.
[21] arXiv:2605.05595 [pdf, ps, other]: Title: Bayesian Multi-Topology Express Transportation Network Design under Posterior Predictive Demand, Sorting-Efficiency and Delivery-Time Uncertainty

Authors: Debashis Chatterjee

Subjects: Other Statistics (stat.OT)

Express transportation network design is uncertain because origin--destination demand, travel time, operating cost, hub congestion, and realized sorting productivity vary over time. Existing multi-topology express network models usually optimize cost and maximum arrival time under fixed input data, which may produce designs that are efficient nominally but fragile under demand surges, route disruptions, and hub productivity losses. This paper develops a Bayesian posterior-predictive framework for multi-topology express transportation network design. The model learns demand, travel-time, cost, and hub-reliability uncertainty from historical or benchmark-calibrated data and propagates them through posterior predictive scenarios. For fully connected, hub-and-spoke, restricted-allocation, and direct-link hybrid topologies, candidate designs are evaluated using posterior expected cost, conditional value-at-risk of maximum arrival time, service reliability, hub hold-time reliability, and emission-aware penalties. A Bayesian multi-structure design methodology is proposed using posterior simulation, sample-average approximation, topology-wise optimization, and Bayes-risk selection. Theoretical results establish existence of a Bayes-optimal design, convergence of posterior scenario risks, and stability of topology selection. Simulation and CAB benchmark experiments show that the Bayesian design can trade modest additional cost for substantial reductions in tail delivery risk and improved hub reliability.
[22] arXiv:2605.05606 [pdf, ps, other]: Title: Variational Smoothing and Inference for SDEs from Sparse Data with Dynamic Neural Flows

Authors: Yu Wang, Arnab Ganguly

Comments: Yu Wang and Arnab Ganguly contributed equally to this work. Corresponding to Arnab Ganguly

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)

Stochastic differential equations (SDEs) provide a flexible framework for modeling temporal dynamics in partially observed systems. A central task is to calibrate such models from data, which requires inferring latent trajectories and parameters from sparse, noisy observations. Classical smoothing methods for this problem are often limited by path degeneracy and poor scalability. In this work, we developed a novel method based on characterization of the posterior SDE in terms of conditional backward-in-time score defined as the gradient of a function solving a Kolmogorov backward equation with multiplicative updates at observation times. We learn this conditional score using neural networks trained to satisfy both the governing PDE and the observation-induced jump conditions, thereby integrating continuous-time dynamics with discrete Bayesian updates. The resulting score induces a posterior SDE with the same diffusion coefficient but a modified drift, enabling efficient posterior trajectory sampling. We further derive a likelihood-based objective for learning the SDE parameters, yielding an evidence lower bound (ELBO) for joint state smoothing and parameter estimation. This leads to a variational EM-style procedure, where the neural conditional score is optimized to approximate the smoothing distribution, followed by a maximization step over the SDE parameters using samples from the induced posterior. Experiments on nonlinear systems demonstrate accurate and stable inference with a very few observations demonstrating significant improved scalability compared to classical MCMC methods.
[23] arXiv:2605.05629 [pdf, ps, other]: Title: Spherical Flows for Sampling Categorical Data

Authors: Jannis Chemseddine, Gregor Kornhardt, Gabriele Steidl

Subjects: Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG)

We study the problem of learning generative models for discrete sequences in a continuous embedding space. Whereas prior approaches typically operate in Euclidean space or on the probability simplex, we instead work on the sphere $\mathbb S^{d-1}$. There the von Mises-Fisher (vMF) distribution induces a natural noise process and admits a closed-form conditional score. The conditional velocity is in general intractable. Exploiting the radial symmetry of the vMF density we reduce the continuity equation on $\mathbb S^{d-1}$ to a scalar ODE in the cosine similarity, whose unique bounded solution determines the velocity. The marginal velocity and marginal score on $(\mathbb S^{d-1})^L$ both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights. This gives access to both ODE and predictor-corrector (PC) sampling. The posterior is the only learned object, trained by a cross-entropy loss. Experiments compare the vMF path against geodesic and Euclidean alternatives. The combination of vMF and PC sampling significantly improves results on Sudoku and language modeling.
[24] arXiv:2605.05666 [pdf, ps, other]: Title: Causal Inference of Blood Pressure Reduction and Coronary Heart Disease Risk in the Framingham Study

Authors: Suchibrata Patra

Comments: 13 pages, 5 figures. Submitted manuscript

Subjects: Applications (stat.AP); Methodology (stat.ME)

Standard cardiovascular risk calculators, including the Framingham Risk Score and the ACC/AHA Pooled Cohort Equations, estimate the conditional probability P(CHD | SysBP = s) rather than the interventional quantity P(CHD | do(SysBP = s)). When confounding is present, this distinction has direct clinical consequences: observational estimates may systematically overstate the absolute benefit of antihypertensive treatment. We applied Pearl's do-calculus to the Framingham Heart Study Offspring Cohort (n = 4,240; primary analysis on 3,776 complete cases; 574 ten-year coronary heart disease events). A structurally corrected directed acyclic graph (DAG) was specified and evaluated using conditional independence testing. The average causal effect (ACE) of a 20 mmHg systolic blood pressure reduction was estimated by g-computation with bootstrap confidence intervals, corroborated by propensity score matching and inverse probability weighting. G-computation yielded an ACE of 3.40 percent absolute risk reduction (95 percent CI: 2.64 to 4.14), compared with a naive observational estimate of 4.14 percent, corresponding to an approximate 21.8 percent relative overestimation. Conditional average treatment effects were estimated using R-Learner and T-Learner metalearners. These findings suggest that observational cardiovascular risk tools may overestimate the absolute benefit of blood pressure reduction, with implications for clinical risk stratification and prescribing thresholds.
[25] arXiv:2605.05683 [pdf, ps, other]: Title: Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

Authors: Andy Zeyi Liu, Elliot Paquette, John Sous

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family of decoder-only models adapted from the modded NanoGPT codebase, we introduce an empirical protocol based on activation covariance and per-sample gradient SVD spectra. This dual-view reveals three empirical findings and one mechanistic explanation. First, batch size acts as a latent determinant of representation geometry: runs that reach equal loss settle into systematically distinct activation spectra. Second, the activation covariance tail measured early in training reliably forecasts downstream token efficiency. Third, movement of the activation spectrum head (leading modes), together with gradient spectra, characterizes underlying learning-dynamics changes, separating learning-side architectural improvements from primarily execution-side gains. These predictive and diagnostic signals persist across the 12-, 36-, and 48-layer model tiers. Finally, a mechanistic model proves the main observations and explains how activation covariance spectra correlate with task-aligned feature learning.
[26] arXiv:2605.05684 [pdf, ps, other]: Title: Latent Impact and Differential Item Functioning Analysis for Asymmetric IRT Models

Authors: Gabriel Wallin, Qi Huang

Subjects: Methodology (stat.ME)

Differential item functioning (DIF) arises alongside latent population heterogeneity in many applications, and both must be accounted for when assessing measurement invariance. In many practical settings, however, the comparison groups are unobserved and anchor items are unknown. A further challenge is that item response theory models traditionally assume symmetric link functions, yet empirical response processes may exhibit substantial asymmetry. This paper proposes a general framework for jointly analysing impact and DIF under asymmetric item response models. Unobserved group differences are represented by latent classes within a mixture item response model, while item-specific shifts capture DIF effects. Assuming the number of DIF items is relatively small, an $\ell_1$-regularised estimator is used to simultaneously identify the latent classes and select DIF items without requiring observed group labels or pre-specified anchor items. A simulation study evaluates recovery of impact, item parameters, and DIF effects across a range of configurations. The method is illustrated using two empirical applications from educational testing. In one dataset, the selected model reveals both impact and item-level DIF, whereas in the other, the results indicate substantial impact but little evidence of item-level DIF.
[27] arXiv:2605.05743 [pdf, ps, other]: Title: Fourier Feature Methods for Nonlinear Causal Discovery: FFML Scoring and FFCI Testing in Mixed Data

Authors: Joseph D. Ramsey

Comments: 16 pages, 2 figures, 3 tables

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Gaussian process marginal likelihood scores and kernel conditional independence tests are theoretically appealing for nonlinear causal discovery but computationally prohibitive at scale. We present two complementary RFF-based methods forming a practical toolkit for score-based, constraint-based, and hybrid causal discovery.
The Fourier Feature Marginal Likelihood (FFML) score approximates the exact GP marginal likelihood by replacing the n x n kernel Gram matrix with a finite-dimensional feature representation, reducing cost to O(nm^2 + m^3) while retaining the probabilistic interpretation and automatic complexity penalty of the exact score. FFML extends to mixed (continuous + discrete) parent sets via a product-kernel construction, with a Kronecker path for small discrete parent sets and a Hadamard-product path otherwise.
The Fourier Feature Conditional Independence (FFCI) test is a fast nonparametric CI test for mixed data. Each variable is featurized individually: continuous variables via RFF or Orthogonal Random Features (ORF), discrete variables via a Cholesky-factored categorical feature map, with blocks concatenated. Conditioning uses ridge residualization in feature space; the test statistic is a Frobenius norm of the residualized cross-covariance, approximated as a weighted sum of chi-squared variables.
Although FFML and FFCI share the same RFF/ORF machinery, they differ architecturally: FFML builds a joint kernel over a parent set for scoring, while FFCI featurizes variables individually for testing. We compare FFML to TRFF, a penalized Student-t regression alternative. Empirically, BOSS+FFML outperforms linear and kernel-ridge baselines on nonlinear data. When run through the same PC-Max implementation, FFCI and RCIT exhibit complementary precision-recall profiles: RCIT is more precise while FFCI achieves better recall and lower SHD, and runs in one third the time.
[28] arXiv:2605.05744 [pdf, ps, other]: Title: A Stein Characterization-type Omnibus Tests for the Discrete Pareto Distribution

Authors: Deepesh Bhati, Bruno Ebner, Sakshi Khandelwal

Comments: 24 pages, 4 tables, 2 figures

Subjects: Methodology (stat.ME); Statistics Theory (math.ST)

The discrete Pareto (or Zeta, Zipf) distribution, arises naturally in modeling rank-frequency data across diverse fields such as linguistics, demography, biology, and computer science. Despite its widespread applicability, goodness-of-fit testing for the discrete Pareto distribution remains underdeveloped, particularly in the presence of heavy tails and infinite support. This article introduces a novel goodness-of-fit test based on a new Stein-type characterization of the discrete Pareto distribution, formulated using its probability generating function. The proposed method is applicable even when the shape parameter is unknown and avoids binning or smoothing techniques. We study the asymptotic properties of the test and assess its empirical size and power through extensive simulation experiments. The results show that the proposed test either outperforms or matches the performance of existing method across various alternatives. Applications to real datasets are provided to demonstrate its practical relevance and robustness.
[29] arXiv:2605.05752 [pdf, ps, other]: Title: Generative AI-Based Monte Carlo Simulation for Method Evaluation Using Synthetic Multilevel Data

Authors: Youmi Suk, Chenguang Pan, Weixuan Xiao

Comments: 31 pages for the main text

Subjects: Methodology (stat.ME); Applications (stat.AP)

The role of AI-generated synthetic data has recently been expanded to support realistic Monte Carlo simulations. However, guidance is limited on generating data with multilevel structures and designing simulations based on such data. This study proposes a general framework for AI-based simulation studies to evaluate the predictive performance and parameter recovery of quantitative methods, specifically using multilevel data commonly observed in the social sciences. Our proposed six-stage workflow consists of (i) specifying a method and real data, (ii) training Generative AI with real data, (iii) assessing synthetic data quality, (iv) designing and conducting simulations, (v) evaluating method performance, and (vi) checking robustness. To enhance fidelity in multilevel data generation, we also introduce targeted modifications to diffusion models and Generative Adversarial Networks (GANs). Furthermore, we develop a systematic quality evaluation framework that assesses both within-table and between-table fidelity, and discuss how AI-based simulation designs should differ depending on whether the simulation's objective is predictive performance or parameter recovery. Finally, using empirical multilevel data and multilevel modeling methods, we demonstrate the utility of the proposed AI-based simulation framework. This approach leads to more accurate and honest evaluations of quantitative methods in the real world, unlike traditional simulation studies based on arbitrary simulated scenarios.
[30] arXiv:2605.05755 [pdf, ps, other]: Title: Transformers Provably Implement In-Context Reinforcement Learning with Policy Improvement

Authors: Haodong Liang, Lifeng Lai

Comments: 25 pages, 4 figures

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

We investigate the ability of transformers to perform in-context reinforcement learning (ICRL), where a model must infer and execute learning algorithms from trajectory data without parameter updates. We show that a linear self-attention transformer block can provably implement policy-improvement methods, including semi-gradient SARSA and actor-critic, via explicit parameter constructions. Beyond existence, we design a teacher-mimicking training procedure, analyze its gradient-flow dynamics, and establish the first convergence guarantee in the ICRL literature: under suitable richness conditions on the training MDP distribution, gradient flow converges locally and exponentially to an optimal parameter manifold corresponding to the desired RL update. Empirically, training transformers on randomly generated tabular MDPs confirms these predictions: the learned models recover the parameter structure of our explicit constructions and, when deployed on unseen MDPs, deliver strong in-context control performance. Together, these results illuminate how transformer architectures internalize and execute classical reinforcement learning algorithms in context, bridging mechanistic understanding and training dynamics in ICRL.
[31] arXiv:2605.05768 [pdf, ps, other]: Title: Optimal Confidence Band for Kernel Gradient Flow Estimator

Authors: Yuqian Cheng, Zhuo Chen, Qian Lin

Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)

In this paper, we investigate the supremum-norm generalization error and the uniform inference for a specific class of kernel regression methods, namely the kernel gradient flows. Under the widely adopted capacity-source condition framework in the kernel regression literature, we first establish convergence rates for the supremum norm generalization error of both continuous and discrete kernel gradient flows under the source condition $s>\alpha_0$, where $\alpha_0\in(0,1)$ denotes the embedding index of the kernel function. Moreover, we show that these rates match the minimax optimal rates. Building on this result, we then construct simultaneous confidence bands for both continuous and discrete kernel gradient flows. Notably, the widths of the proposed confidence bands are also optimal, in the sense that their shrinkage rates are greater than, while can be arbitrarily close to, the minimax optimal rates.
[32] arXiv:2605.05772 [pdf, ps, other]: Title: UD-DML: Uniform Design Subsampling for Double Machine Learning over Massive Data

Authors: Yuanke Qu, Xiaoya Xu, Hengtao Zhang

Subjects: Methodology (stat.ME)

Double machine learning (DML) delivers valid inference on low-dimensional causal parameters while permitting flexible nuisance estimation, but its computational cost becomes prohibitive once cross-fitted learners must be trained on massive observational data. Applying DML to a uniformly drawn subsample alleviates this burden, yet such a reduction disregards the geometry of the covariate space and can exacerbate treated-control imbalance as well as overlap deficiency. We propose Uniform Design Double Machine Learning (UD-DML), a design-based subsampling strategy for average treatment effect (ATE) estimation. UD-DML first constructs a low-discrepancy skeleton in a PCA-rotated covariate space under the mixture-discrepancy criterion, and then assigns, to each skeleton point, the nearest treated and control units via KD-tree search. The resulting matched subsample is, by construction, both representative of the full covariate distribution and balanced across treatment arms; cross-fitted DML is subsequently applied to it. We establish discrepancy-based guarantees for representativeness and balance, and prove that the UD-DML estimator is $\sqrt{r}$-asymptotically normal under mild conditions, where the selected subsample size $r \ll n$. The dominant nuisance-fitting cost is thereby reduced from the $n$-scale to the $r$-scale. Monte Carlo experiments show that UD-DML attains lower RMSE, narrower confidence intervals and more reliable coverage than uniform subsampling, with the largest gains in low-overlap and misspecified regimes. An application to a large observational dataset further demonstrates its practical feasibility.
[33] arXiv:2605.05798 [pdf, ps, other]: Title: Dual-Homotopy Framework for Constrained EM Algorithm

Authors: Jisoo Choi, Hee-Seok Oh

Subjects: Methodology (stat.ME)

We propose a new constrained EM algorithm that is applicable to general constrained estimation problems. The proposed method is based on a novel framework, the `dual-homotopy framework,' which combines deterministic annealing EM with a barrier-based optimization, enabling stable estimation under parameter constraints. Building on this framework, we further introduce an adaptive constrained EM algorithm that preserves likelihood monotonicity, regardless of the underlying distributional form or the specific structure of the constraints. Through simulation studies and a real-data analysis, both under parameter constraints, we demonstrate that the proposed algorithm yields more stable and accurate estimates than existing methods, including the standard EM algorithm.
[34] arXiv:2605.05808 [pdf, ps, other]: Title: Ratio-based Loss Functions

Authors: Lena Helgerth, Andreas Christmann

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)

Algorithms in machine learning and AI do critically depend on at least three key components: (i) the risk function, which is the expectation of the loss function, (ii) the function space, which is often called the hypothesis space, and (iii) the set of probability measures, which are allowed for the specified algorithm. This paper gives a survey of a certain class of loss functions, which we call ratio-based. In supervised learning, margin-based loss functions for classification tasks depending on the product of the output values $y_i$ and the predictions $f(x_i)$ as well as distance-based loss functions depending on the difference of $y_i$ and $f(x_i)$ for regression are common. Distance-based loss functions are in particular useful, if an additive model assumption seems plausible, i.e. the common signal plus noise assumption. However, in the literature, several loss functions proposed for regression purposes have a multiplicative error structure in mind and pay attention to relative errors, i.e. to the ratio of $y_i$ and $f(x_i)$. In this survey article, we systematically investigate such ratio-based loss functions and propose a few new losses, which may be interesting for future research. We concentrate on investigating general properties of ratio-based loss functions like continuity, Lipschitz-continuity, convexity, and differentiability, because these properties play a central role in most machine learning algorithms. Therefore, we do not focus on some specific machine learning algorithm to derive universal consistency, learning rates, or stability results. Instead, we want to enable future research in this direction.
[35] arXiv:2605.05809 [pdf, ps, other]: Title: Detecting Changes in Causal Dependence with Kernels and Copulas

Authors: Shakeel Gavioli-Akilagun, Kieran Wood, Francesco Quinzan

Comments: 34 pages, 5 figures

Subjects: Methodology (stat.ME); Machine Learning (stat.ML)

We propose a framework for determining whether the causal dependence of an outcome $Y$ on a covariate $X$ changes at a given time point, given confounders $\boldsymbol{Z}$. For instance, in financial markets, the effect of a market indicator on asset returns may causally change over time. While many existing measures of association can be used to detect changes in joint and marginal distributions, in the absence of strong assumptions on the data generating process none are suitable for detecting changes in the causal mechanism or in the strength of causal relationship. In this work we approach the problem from a fully non-parametric perspective, and treat the causal mechanism as well as the distribution of the data as unknown. We introduce a quantity based on the integrated difference between kernel mean embeddings of certain conditionals copula, which is provably equal to zero if the causal dependence does not change and strictly positive else. A near-linear time estimator for the quantity is proposed, with rates of convergence explicitly spelled out. Extensive experiments demonstrate that the proposed statistic achieves high accuracy on multiple synthetic and real-world datasets. We additionally show how the proposed statistic can be used for change point detection when the goal is to detect changes in causal dependence occurring at an unknown times.
[36] arXiv:2605.05859 [pdf, ps, other]: Title: Estimation of treatment effects in presence of differential use of post-randomization concomitant medication with time-to-event outcomes

Authors: Helene C. W. Rytgaard, Edwin Fong, Jens M. Tarp, Thomas A. Gerds, Mark J. van der Laan, Henrik Ravn

Subjects: Methodology (stat.ME)

In placebo-controlled randomized trials, the post-randomization use of concomitant medications may be higher in the placebo arm than in the treatment arm. This may dilute the full benefits of the randomized drug as estimated by the intention-to-treat analysis. We focus on cardiovascular outcomes trials in type-2 diabetes patients of glucose-lowering treatments where patients in the placebo arm are more likely to add other glucose-lowering agents with established cardio-protective properties. As a supplement to the intention-to-treat analysis, we propose a class of estimands within a causal framework that isolates the specific impact of the treatment being studied from that of concomitant treatment use. These estimands are defined under time-dependent treatment interventions to balance exposure to additional medications across intervention arms. We advocate for specific stochastic interventions to achieve this balance while minimizing positivity violations, which arise when certain treatment combinations or characteristics are not sufficiently represented in the data. We employ targeted minimum loss-based estimation (TMLE) to optimize the estimation procedure for our estimands while allowing for flexible adjustments for time-dependent covariates from follow-up visits. Finally, we demonstrate the application of the methods through a simulation study and a real-world example from the LEADER cardiovascular outcomes trial, which assessed cardiovascular risk for liraglutide versus placebo.
[37] arXiv:2605.05873 [pdf, ps, other]: Title: CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency

Authors: Hirofumi Ota, Naoto Iwase, Yuki Ichihara, Junpei Komiyama, Masaaki Imaizumi

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)

Large language models often improve reasoning by sampling multiple outputs and aggregating their final answers, but precise and efficient control of error levels remains a challenging task. In particular, deciding when to stop sampling remains difficult when the stopping rule is data-dependent and the set of possible answers is not known in advance. We study anytime-valid certification of a prespecified target answer as the unique mode of the model's response distribution, a guarantee distinct from answer correctness. We propose the Certification by Intersection-union Testing with E-processes (CITE) algorithm, which provably controls false certification at any prescribed level under arbitrary data-driven stopping, without requiring prior knowledge of the answer category set. We also prove an category-set-size-free stopping-time rate, establish matching minimax lower bounds up to constants in the main regime, and extend the construction to confidence-weighted voting. Simulations and LLM self-consistency experiments show empirical error control and improved certification in diffuse-tail settings.
[38] arXiv:2605.05882 [pdf, ps, other]: Title: Tuning Derivatives for Causal Fairness in Machine Learning

Authors: Filip Edström, Guilherme W. F. Barros, Tetiana Gorbach, Xavier de Luna

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)

Artificial-intelligence systems are becoming ubiquitous in society, yet their predictions typically inherit biases with respect to protected attributes such as race, gender, or age. Classical fairness notions, most notably Statistical Parity (SP), demand that predictions be independent of the protected attributes, but are overly restrictive when these attributes influence mediating variables that are considered business necessities. Recent causal formulations relax SP by distinguishing allowed from not-allowed causal paths and by complementing SP with Predictive Parity (PP), requiring the predictor to replicate the legitimate influence of business-necessities. Existing path-based definitions are mainly practical when applied to categorical attributes. This paper introduces a new framework for fairness in structural causal models that is tailored to continuous protected attributes. We formalize SP and PP through path-specific partial derivatives, establish conditions under which these criteria coincide with prior causal definitions, and characterize when a fair predictor, one that satisfies SP along not-allowed paths while achieving PP along allowed paths, exists. Building on this theory, we propose a fair tuning algorithm that either constructs such a predictor or, when not possible, allows for a trade-off between SP and PP. We present experiments on simulated and real data to evaluate our proposal, compare it with previously proposed methods, and show that it performs better when PP is considered.
[39] arXiv:2605.05923 [pdf, ps, other]: Title: Joint modelling of time-dependent biomarker variability and time-to-event outcomes, a two-step approach

Authors: Felix Boakye Oppong, Dimitris Rizopoulos, Thierry Gorlia, Nicole Erler

Subjects: Methodology (stat.ME)

Increasing evidence suggests that variability in longitudinal biomarkers, in addition to their mean trajectory, carries prognostic information for time-to-event outcomes. However, standard joint models typically capture only the expected value of the biomarker process, assuming constant residual variability across individuals and time. Fully joint extensions that model within-subject variability exist but are computationally demanding and require dedicated software packages. We propose a flexible two-step approach for incorporating biomarker variability into joint models. First, residuals (or their transformations) from a mixed-effects model are used to derive subject- and time-specific measures of variability. Second, these variability measures are included in a standard joint model, allowing their association with survival to be estimated alongside the mean biomarker trajectory. Our approach can also accommodate multiple biomarkers simultaneously and is readily implemented using existing joint modeling software without custom extensions. Through simulations, we show that our method provides reasonable performance for variability effects across a range of scenarios. We further illustrate our approach using longitudinal data of white blood cell counts from a large phase III glioblastoma trial, demonstrating that both mean levels and variability of hematological markers carry prognostic information for overall survival.
[40] arXiv:2605.05930 [pdf, ps, other]: Title: Toward design-based inference for data integration

Authors: Andrius Čiginas, Ieva Burakauskaitė, Jae Kwang Kim

Comments: 31 pages, 6 figures, 7 tables

Subjects: Methodology (stat.ME)

Integrating non-probability samples into finite-population inference typically requires modeling unknown selection probabilities under a missing-at-random (MAR) assumption that is difficult to verify. We propose a design-based alternative in which the non-probability sample is treated as a fully observed certainty stratum and a probability sample is drawn only from the complementary, previously unsampled units. Within this sequential framework, we develop two generalized regression estimators: one fitting the outcome model separately in the complementary stratum, the other pooling both samples; we make two distinct contributions. First, both estimators are design-consistent and admit consistent variance estimators with no assumption whatsoever on the non-probability selection mechanism, including under not-missing-at-random (NMAR) selection. Second, under a working superpopulation model that holds in both strata, the pilot non-probability sample can be used to construct second-stage inclusion probabilities that achieve Isaki-Fuller asymptotic optimality for the separate estimator; this optimality claim relies on assumptions strictly stronger than MAR, but its failure does not invalidate the consistency results above. A diagnostic test for coefficient homogeneity is proposed to guide the choice between the two estimators. Simulations confirm that the sequential estimators remain essentially unbiased under both MAR and NMAR, while propensity-adjusted competitors can be severely biased under NMAR. Two applications from Lithuanian official statistics illustrate that separate regression is preferable when the pilot stratum and its complement are strongly heterogeneous, whereas combined regression offers a modest efficiency gain when the two strata are similar.
[41] arXiv:2605.05973 [pdf, ps, other]: Title: Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking

Authors: Yang Xu, Jiefu Zhang, Haixiang Sun, Zihan Zhou, Tianyu Cao, Vaneet Aggarwal

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)

Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level target under explicit tuning budgets. We propose SIREN, a selection-aware repeated-split reporting protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an item-level Gaussian multiplier bootstrap for uncertainty quantification. In a fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation, and the bootstrap yields valid simultaneous inference on a finite budget grid. This supports confidence intervals for procedure-performance curves and pre-specified equal-budget and cross-budget comparisons. Controlled simulations and MMLU-Pro tuning experiments show that winner-based reporting can be optimistic and can change deployment conclusions, while SIREN remains close to the finite-sample reporting target.
[42] arXiv:2605.05984 [pdf, ps, other]: Title: Separable Effects in Four-Arm and Two-Arm Designs

Authors: Chan Park, Youmi Suk

Subjects: Methodology (stat.ME)

Robins and Richardson (2010) reformulated mediation analysis by decomposing treatments into multiple components and examining separable effects of each component. While this approach is increasingly popular, existing work has analyzed ``two-arm'' data, where components are strictly bundled and manipulated simultaneously. However, in practice, four-arm data where components are assigned independently are often available. For example, testing accommodations might strictly bundle extra time with a separate session or allow them to be assigned separately. To address this distinction, we propose a general framework for analyzing separable effects in four-arm and two-arm designs. This framework provides distinct identification and estimation strategies for each design. For estimation, we utilize efficient influence function estimators coupled with machine learning and cross-fitting techniques. Additionally, we introduce two falsification tests for key identification assumptions required in the two-arm design by leveraging four-arm data. We investigate the performance of the proposed estimators via a simulation study and demonstrate their application by studying the effect of extended time accommodations using data from the National Assessment of Educational Progress. Ultimately, this separable effects analysis enables practitioners to clearly communicate underlying mechanisms and derive informative policy recommendations.
[43] arXiv:2605.05993 [pdf, ps, other]: Title: TabCF: Distributional Control Function Estimation with Tabular Foundation Models

Authors: Geping Chen, Chunlin Li, Tianzhong Yang, Zhengyuan Zhu, Jing Zhou

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME); Other Statistics (stat.OT)

Instrumental variable (IV) and control function (CF) methods are powerful tools for causal effect estimation in the presence of unmeasured confounding, yet most existing approaches target only mean effects and/or demand substantial fitting and tuning effort. In this paper, we introduce a simple method, TabCF, for control function regression using tabular foundation models, which enables accurate, fast, identification-transparent, and tuning-light causal estimation of distributional quantities, such as interventional means and quantiles; we also propose a copula-based approximation for multivariate outcomes. TabCF performs favorably against representative methods across a broad range of small- to medium-sized synthetic and real data scenarios. The central message is two-fold: for practitioners, it highlights that TabCF is an effective tool for distributional causal inference; for researchers, it suggests that the proposed approach could be considered a strong baseline for future method development. Code is available at https://github.com/GepingChen/TabCF.
[44] arXiv:2605.05996 [pdf, ps, other]: Title: Gaussian mixture models in Hilbert spaces via kernel methods

Authors: Daniel López-Montero, Antonio Álvarez-López, Marcos Matabuena

Comments: 38 pages, 13 figures

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Modern datasets across many disciplines increasingly consist of time-evolving, potentially infinite-dimensional random objects, such as dynamic functional data, which are naturally modeled in Hilbert spaces. In these settings, characterizing probability measures, for example, through densities, can be ill-defined or technically challenging. Motivated by clustering applications, we propose a Gaussian mixture framework for Hilbert-space-valued data based on kernel mean embeddings and develop efficient optimization algorithms for estimation. We establish theoretical guarantees showing that the proposed algorithm is well defined and that the model yields a dense class of approximations in infinite-dimensional spaces. We evaluate the framework through extensive experiments on diverse structures and data geometries, including $L^2$-functional data and random graphs in Laplacian spaces arising in modern medical applications.
[45] arXiv:2605.06059 [pdf, ps, other]: Title: Correcting heterogeneous diagnostic bias when developing clinical prediction models using causal hidden Markov models

Authors: Jose Benitez-Aurioles, Ricardo Silva, Brian McMillan, Matthew Sperrin

Comments: 4 figures, 2 tables, 4 supplementaries

Subjects: Applications (stat.AP); Machine Learning (cs.LG)

In routine care, individuals identified a priori as high-risk are usually tested for conditions more frequently. Protected attributes, such as sex or ethnicity may also determine testing frequency. Such heterogeneous detection rates across a population induce label error. This causes systematic model error for specific groups and biases performance metrics during validation.
This paper proposes a method to correct for such bias in prediction models due to differential diagnostic delay. We use a causal inference framework to define our target estimand: an individual's diagnosis probability in a counterfactual scenario where their diagnosis rate matches that of a reference group. We model the longitudinal process as a hidden Markov model, in which confirmatory test results are emissions from a latent progressive disease stage. We validate our approach in simulated data and apply it to a case study of chronic kidney disease prediction using electronic health records.
In simulations, our method reduces prediction bias and improves calibration-in-the-large, correcting the Observed:Expected ratio in the underdiagnosed group from 1.34 (standard deviation: 0.09) in a model developed without any correction for underdiagnosis bias to 1.02 (0.09). Violations of assumptions in the simulation affected the estimation of model parameters, but the proposed approach nonetheless remained better calibrated than the standard model. In the clinical case study, we identify diabetes as the main driver of observability, with an odds ratio of 10.36 (95% confidence interval, 9.80 - 11.02) in 6-month urine albumin-creatinine ratio testing rate. Using our approach to predict the counterfactual diagnostic rate in patients without diabetes, we improved the Observed:Expected ratio of a developed clinical prediction model from 1.55 (1.51 - 1.59) to 1.01 (0.98 - 1.04).
[46] arXiv:2605.06091 [pdf, ps, other]: Title: Time-Inhomogeneous Preconditioned Langevin Dynamics

Authors: Alexander Falk, Laurenz Nagler, Andreas Habring, Thomas Pock

Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR); Computation (stat.CO)

Langevin sampling from distributions of the form $p(x) \propto \exp(-\Psi(x))$ faces two major challenges: (global) mode coverage and (local) mode exploration. The first challenge is particularly relevant for multi-modal distributions with disjoint modes, whereas the second arises when the potential $\Psi$ exhibits diverse and ill-conditioned local mode geometry. To address these challenges, a common approach is to precondition Langevin dynamics with problem-specific information, such as the sample covariance or the local curvature of $\Psi$. However, existing preconditioner choices inherently involve a trade-off between global mode coverage and local mode exploration, and no prior method resolves both simultaneously. To overcome this limitation, we propose the TIPreL, which introduces a time- and position-dependent preconditioner. This design effectively addresses both challenges mentioned above within a single framework. We establish convergence of the resulting dynamics in the Wasserstein-2 distance both in continuous time and for a tamed Euler discretization. In particular, our analysis extends the existing state of the art by proving convergence under time- and space-dependent diffusion coefficients, and only locally Lipschitz drifts, which has not been covered by prior work. Finally, we experimentally compare TIPreL with competing preconditioning schemes on a two-dimensional, severely ill-posed example and on a Bayesian logistic regression task in higher dimensions, confirming the efficiency of the proposed method.
[47] arXiv:2605.06135 [pdf, ps, other]: Title: Linked-Tucker Factorized Individualized Regression for Paired Multivariate Categorical Outcomes

Authors: Arkaprava Roy, Jeremy T. Gaskins, Steven Levy, Somnath Datta

Subjects: Methodology (stat.ME); Applications (stat.AP)

We propose a joint individualized hurdle-ordinal regression model for paired zero-inflated ordinal outcomes with subject-specific, spatially varying, and time-varying covariate effects, motivated by the Iowa Fluoride Study (IFS). The two outcomes, dental caries and dental fluorosis, are measured repeatedly across ages at fine spatial resolution, yielding nested longitudinal data with substantial zero inflation, ordinality, and heterogeneity across individuals and locations. For each outcome, a hurdle component models disease presence, while a proportional-odds component models severity among positive observations. To parsimoniously represent the high-dimensional coefficient arrays, we introduce a linked Tucker tensor factorization. Shared subject-mode factors induce dependence between the caries and fluorosis coefficient tensors, while separate spatial factors accommodate the distinct measurement grids of tooth surfaces and tooth zones. A horseshoe prior on the core tensor elements encourages sparsity, and posterior computation is performed using the No-U-Turn Sampler in NumPyro. Population-level effect summaries are obtained by projecting individualized posterior linear predictors onto the design space, and Wasserstein barycenters aggregate these summaries across tooth locations and anatomical classes. Applied to the IFS, the model reveals spatially heterogeneous associations between early-life fluoride and dietary exposures and both outcomes. Fluoride exposure is associated with increased odds and severity of fluorosis, while soda intake consistently increases caries risk. These associations differ between presence and severity components and vary across tooth locations, ages, and subpopulations defined by prior caries status, highlighting the importance of the joint hurdle-ordinal framework for disentangling disease occurrence from disease progression in multilevel dental data.
[48] arXiv:2605.06168 [pdf, ps, other]: Title: Scalable model selection for count time series with structural breaks: application to solid-organ transplantation during and after COVID-19 in the USA and Italy

Authors: Tobia Filosi, Emiliano Ceccarelli, Emilio Porcu, Elena Del Sordo, Libia Lara-Carrion, Giuseppe Iuppa, Francesca Puoti, Silvia Trapani, Silvia Testa, Giovanna Jona Lasinio

Subjects: Applications (stat.AP)

Weekly healthcare activity data are typically non-negative counts with temporal dependence and occasional system-wide disruptions, settings in which Gaussian time-series models may be inadequate. Solid organ transplant (SOT) activity provides a representative case study of a count process affected by a large external shock. We analyse weekly SOT counts in the USA and Italy from 2014 to October 2024, stratified by donor type (deceased vs living) and organ (kidney and liver). We fit Poisson and negative-binomial count time-series models incorporating short-term dynamics, calendar effects (holiday weeks), and pre-specified pandemic-period level and/or slope indicators. Candidate specifications are screened within a pre-defined portfolio and selected using BIC within each training window. Forecasting performance is evaluated with an expanding-window design at horizons $h\in\{4,8,12\}$ weeks. Alongside RMSE, we report empirical coverage of nominal $95\%$ predictive intervals and interval widths to summarise calibration and forecast uncertainty. Across strata, selected models capture substantial pandemic-period deviations and varying post-period trajectories. Deceased-donor series are broadly consistent with a return towards pre-pandemic baselines in both countries, whereas the US living-donor series shows a more gradual convergence in this application. Within the explored model class and validation protocol, auxiliary covariates representing COVID burden and mortality add limited incremental predictive contribution beyond autoregressive and calendar components. Our analysis shows that donation time series represent an unconditional phenomenon, with auxiliary variables having a statistically negligible impact on donations, thus allowing a focus on more practical aspects related to ongoing challenges in the post-pandemic era, such as hospital overloads and changes in public perception.
[49] arXiv:2605.06172 [pdf, ps, other]: Title: Expressivity of Bi-Lipschitz Normalizing Flows: A Score-Based Diffusion Perspective

Authors: Meira Iske, Carola-Bibiane Schönlieb

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR)

Many normalizing flow architectures impose regularity constraints, yet their distributional approximation properties are not fully characterized. We study the expressivity of bi-Lipschitz normalizing flows through the lens of score-based diffusion models. For the probability flow ODE of a variance-preserving diffusion, Lipschitz regularity of the score induces a flow of bi-Lipschitz diffeomorphic transport maps. This ODE bridge allows us to analyze the distributional approximation power of bi-Lipschitz normalizing flows and, conversely, derive deterministic convergence guarantees for diffusion-based transport. Our key idea is to use the probability flow ODE to link regularity of the score to regularity of the induced transport maps. We verify score regularity for broad target densities, including compactly supported densities, Gaussian convolutions of compactly supported measures and finite Gaussian mixtures. We obtain a universal distributional approximation result: Gaussian pullbacks induced by bi-Lipschitz variance-preserving transport maps are $L^1$-dense among all probability densities. For Gaussian convolution targets, we further obtain convergence in Kullback-Leibler divergence without early stopping.
[50] arXiv:2605.06204 [pdf, ps, other]: Title: When Does Trimming Help Conformal Prediction? A Retained-Law Diagnostic under Calibration Contamination

Authors: Congye Wang

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Trimming suspicious calibration points is a common response to contamination in conformal prediction. Its effect on clean-target coverage, however, is governed by the retained law induced by trimming, not by the contamination level alone. We analyse fixed-threshold trimming as conditioning rather than purification. It replaces the contaminated calibration law with a retained law, reducing clean-target coverage to a one-dimensional score-CDF transfer problem with an exact finite-sample identity. A componentwise bound on the transfer gap gives a population-level diagnostic. This separates a clean-side covariance cost from a retained-contamination cost, governed by the dirty-to-clean retention ratio. Trimming helps when the anomaly score separates retention probabilities while remaining score-neutral on the clean population. Otherwise, it cannot substantially reduce contamination through the retained mixture coefficient. We also give finite-sample certificate templates that provide numerical guarantees under independent audit.
[51] arXiv:2605.06210 [pdf, ps, other]: Title: Super-Level-Set Regression: Conditional Quantiles via Volume Minimization

Authors: Sacha Braun, Michael I. Jordan, Francis Bach

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)

Constructing minimum-volume prediction regions that satisfy conditional coverage is a fundamental challenge in multivariate regression. Standard approaches rely on explicitly estimating the full conditional density and subsequently thresholding it. This two-step plug-in process is notoriously difficult, sensitive to estimation errors, and computationally expensive. One would like to instead optimize the region directly. Formulating a direct solution is challenging, however, because it requires minimizing a volume objective that is coupled with the conditional quantiles of the model's own estimation error. In this work, we address this challenge. We introduce super-level-set regression (SLS), a novel mathematical framework that successfully resolves this implicit coupling, allowing us to directly parameterize and optimize the geometric boundaries of the target conditional level sets. By bypassing full distribution estimation and leveraging flexible volume-preserving frontier functions, our approach natively captures complex, multimodal, and disjoint conditional structures end-to-end. Ultimately, SLS offers a new perspective on multivariate conditional quantile regression, replacing the restrictive assumptions of density-first methods with a direct geometric optimization strategy.
[52] arXiv:2605.06236 [pdf, ps, other]: Title: A Two-Level Plackett-Luce Model for preference modeling in smart mobility platforms

Authors: M. Santos-Pascual, D. Ríos Insua, P. Angulo

Comments: Preprint version, 20 pages, 10 figures

Subjects: Applications (stat.AP); Methodology (stat.ME)

The Plackett-Luce model is widely used to deal with probabilities in discrete choice settings. This paper introduces a novel two-level Plackett-Luce model combined with a multinomial logistic scheme that provides the basis for the route choice module in a smart mobility platform. For this, we develop Bayesian inference and prediction mechanisms to capture consumers' preferences for personalized route recommendations. The model is empirically tested, allowing for refinements and discussion of its applicability. We also illustrate its practical relevance through several use cases, including relevant route selection, coordinated car pooling, incentive design and synthetic data generation.
[53] arXiv:2605.06237 [pdf, ps, other]: Title: Bayesian Fractional Polynomials for Optimal Dosage Estimation with Fish Nutrition Applications

Authors: Aliaksandr Hubin, Åshild Krogdahl, Guro Løkka, Trond M. Kortner

Comments: 6 pages, 3 figures. Accepted as a long paper to IWSM 2026

Subjects: Methodology (stat.ME); Applications (stat.AP)

The problem of optimal dosage estimation arises in diverse scientific domains, from pharmacology and toxicology to aquaculture and environmental studies. Statistical modeling of nonlinear dose-response relationships is essential to quantify biological effects and determine response-optimal levels. This paper introduces a flexible Bayesian fractional polynomial (BFP) framework for modeling such relationships, allowing for model uncertainty quantification and robust prediction through Bayesian model averaging. Extensive simulation results demonstrate that the proposed BFP approach yields accurate estimation of optimal dose levels, outperforming benchmarks significantly. The approach is demonstrated on real data from fish nutrient requirement experiments.
[54] arXiv:2605.06265 [pdf, ps, other]: Title: ConquerNet: Convolution-Smoothed Quantile ReLU Neural Networks with Minimax Guarantees

Authors: Tianpai Luo, Fangwei Wu, Weichi Wu

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Quantile regression is a fundamental tool for distributional learning but poses significant optimization challenges for deep models due to the non-smoothness of the pinball loss. We propose ConquerNet, a class of \textbf{con}volution-smoothed \textbf{qu}antil\textbf{e} \textbf{R}eLU neural \textbf{net}works, which yield smooth objectives while preserving the underlying quantile structure. We establish general nonasymptotic risk bounds for ConquerNet under mild conditions, providing minimax guarantees over Besov function classes. In numerical studies, we demonstrate that the proposed approach outperforms standard quantile neural networks at multiple quantile levels, showing improved estimation accuracy and training efficiency across the board, with particularly pronounced advantages at high and low quantiles.
[55] arXiv:2605.06288 [pdf, ps, other]: Title: A Topological Sorting Criterion for Random Causal Directed Acyclic Graphs

Authors: Alexander G. Reisach, Antoine Chambaz, Gilles Blanchard, Sebastian Weichwald

Subjects: Methodology (stat.ME); Artificial Intelligence (cs.AI)

Random directed acyclic graphs (DAGs) based on imposing an order on Erd\H{o}s-R\'enyi and scale free random graphs are widely used for evaluating causal discovery algorithms. We show that in such DAGs, the set of nodes reachable via open paths, termed relatives, increases monotonically along the causal order. We assess the prevalence of this pattern numerically, and demonstrate that it can be exploited for causal order recovery via sorting by the estimated number of relatives. We note that many simulations in the literature feature settings where this yields an excellent proxy for the causal order, and show that a strict increase of relatives along the causal order leads to a singular Markov equivalence class. We propose sampling time-series DAGs as a possible alternative and discuss implications for causal discovery algorithms and their evaluation on synthetic data.
[56] arXiv:2605.06289 [pdf, ps, other]: Title: Multimodal Deep Generative Model for Semi-Supervised Learning under Class Imbalance

Authors: Heegeon Yoon, Heeyoung Kim

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

When modeling class-imbalanced data, it is crucial to address the imbalance, as models trained on such data tend to be biased towards the majority classes. This problem is amplified under partial supervision, where pseudo-labels for unlabeled data are predicted based on imbalanced labeled data, propagating the bias. While recent semi-supervised models address class imbalance, they typically assume single-modal input data. However, with the growing availability of multimodal data, it is essential to leverage complementary modalities. In this article, we propose a multimodal deep generative model for semi-supervised learning under class imbalance. Our approach uses separate encoders for each modality, sharing latent variables across modalities, and simplifies joint posterior computation with a product-of-experts method. To further address class imbalance, we replace typical Gaussian distributions with Student's t-distributions for the prior, encoder, and decoder, better capturing the heavy-tailed latent distributions in imbalanced data. We derive a new objective function for training the proposed model on both labeled and unlabeled data using $\gamma$-power divergence. Empirical results on benchmark and real-world datasets demonstrate that our model outperforms baseline methods in generalization, achieving superior classification performance for partially labeled multimodal data with imbalanced class distributions.
[57] arXiv:2605.06315 [pdf, ps, other]: Title: End-to-End Identifiable and Consistent Recurrent Switching Dynamical Systems

Authors: Carles Balsells-Rodas, Zhengrui Xiang, Xavier Sumba, Yingzhen Li

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Learning identifiable representations in deep generative models remains a fundamental challenge, particularly for sequential data with regime-switching dynamics. Existing approaches establish identifiability under restrictive assumptions, such as stationarity or limited emission models, and typically rely on variational autoencoder (VAE) estimators, which introduce approximation gaps that limit the recovery of the latent structure. In this work, we address both the theoretical and practical limitations of this setting. First, we establish identifiability of a broad class of recurrent nonlinear switching dynamical systems under flexible assumptions, significantly extending prior results. Second, we introduce $\Omega$SDS, a flow-based estimator that enables exact likelihood optimization using expectation-maximisation. Through empirical validation on both synthetic and real-world data, our results demonstrate that $\Omega$SDS achieves improved disentanglement compared to VAE-based estimators and more accurate forecasting of underlying dynamics.
[58] arXiv:2605.06367 [pdf, ps, other]: Title: The Interplay of Data Structure and Imbalance in the Learning Dynamics of Diffusion Models

Authors: Flavio Nicoletti, Chenxiao Ma, Enrico Ventura, Luca Saglietti, Stefano Sarao Mannelli

Subjects: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)

Real-world datasets are inherently heterogeneous, yet how per-class structural differences and sampling imbalance shape the training dynamics of diffusion models-and potentially exacerbate disparities-remains poorly understood. While models typically transition from an initial phase of generalization to memorizing the training set, existing theory assumes homogeneous data, leaving open how class imbalance and heterogeneity reshape these dynamics. In this work, we develop a high-dimensional analytical framework to study class-dependent learning in score-based diffusion models. Analyzing a random-features model trained on Gaussian mixtures, we derive the feature-covariance spectrum to characterize per-class generalization and memorization times. We reveal the explicit hierarchy governing these dynamics: class variance is the primary determinant of learning order-consistently favoring higher-variance classes-while centroid geometry plays a secondary role. Sampling imbalance acts as a modulator that can reverse this ordering and, under strong imbalance, forces minority classes to acquire distinct, delayed speciation times during backward diffusion. Together, these results suggest that diffusion models can memorize some classes while others remain insufficiently learned. We validate our theoretical predictions empirically using U-Net models trained on Fashion MNIST.
[59] arXiv:2605.06373 [pdf, ps, other]: Title: Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $τ$-Mixing

Authors: Leon Halgryn (1), Sophie Langer (2), Janusz M. Meylahn (1), E. Moritz Hahn (1) ((1) University of Twente, (2) Ruhr-Universität Bochum)

Comments: 48 pages total. 6 figures; 3 tables

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Finite-sample analyses of deep Q-learning typically treat replayed data as independent, even though it is sampled from temporally dependent state-action trajectories. We study the Deep Q-networks (DQN) algorithm under explicit dependence by modelling the minibatches used for updating the network as $\tau$-mixing. We show that this assumption holds under certain dependence conditions on the underlying trajectories and the mechanism used to sample minibatches. Building on this observation, we extend statistical analyses of DQN with fully connected ReLU architectures to dependent data. We formulate each update as a nonparametric regression problem with $\tau$-mixing observations and derive finite-sample risk bounds under this dependence structure. Our results show that temporal dependence leads to a degradation in the statistical rate by inducing an additional dimensionality penalty in the rate exponent, reflecting the reduced effective sample size of $\tau$-mixing data. Moreover, we derive the sample complexity of DQN under $tau$-mixing from these risk bounds. Finally, we empirically demonstrate on standard Gymnasium environments that the independence assumption is systematically violated and that replay sampling yields approximately exponentially decaying correlations, supporting our theoretical framework.
[60] arXiv:2605.06413 [pdf, ps, other]: Title: Decoupled PFNs: Identifiable Epistemic-Aleatoric Decomposition via Structured Synthetic Priors

Authors: Richard Bergna, Stefan Depeweg, José Miguel Hernández-Lobato

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Prior-Fitted Networks (PFNs) amortize Bayesian prediction by meta-learning over a synthetic task prior, but their standard output is a posterior predictive distribution over noisy observations. For sequential decision-making, such as active learning and Bayesian optimization, acquisition should prioritize epistemic uncertainty about the latent signal rather than irreducible aleatoric observation noise. We show that this epistemic--aleatoric split is not identifiable in general from the posterior predictive distribution alone, even when that distribution is known exactly. We then exploit a distinctive advantage of PFNs: because the synthetic data-generating process is under our control, each task can contain an explicit latent signal and noise function, and the generator can provide query-level labels for both the noiseless target and the observation-noise variance. We use these labels to train a decoupled PFN with separate latent-signal and aleatoric heads. The observation-level predictive is induced by convolving the latent signal distribution with the learned noise model. Empirically, epistemic-only acquisition mitigates the failure mode of total-variance exploration in noisy and heteroscedastic settings. In matched comparisons, decoupled models usually improve over tuned observation-level baselines, with the clearest gains in HPO; in broader sweeps, a decoupled model obtains the best average rank in both HPO and synthetic BO.
[61] arXiv:2605.06417 [pdf, ps, other]: Title: Minimax estimation of Functional Principal Components from noisy discretized functional data: the case of smooth processes

Authors: Nassim Bourarach, Franck Picard, Vincent Rivoirard, Angelina Roche

Subjects: Statistics Theory (math.ST)

We study the minimax estimation of covariance eigenfunctions and eigenvalues in functional principal component analysis when $n$ trajectories are observed at $p$ common grid points with additive noise. We consider covariance kernels with arbitrary H\"older smoothness and no prescribed parametric decay of the eigenvalues. In this setting, kernel smoothness and local spectral separation play distinct roles: a minimax inconsistency result over the smoothness-only class shows that kernel regularity alone is not sufficient for minimax-consistent eigenfunction estimation. To capture this interplay, we introduce a class of processes that jointly controls the H\"older smoothness of the covariance kernel and a local relative inverse eigengap quantity at the target index $\ell$. Over this class, we derive non-asymptotic minimax lower bounds for eigenfunction estimation that disentangle sampling variability, discretization and spectral effects, revealing rates of order $\delta_\ell n^{-1}+p^{-2\alpha}$, where $\delta_\ell$ quantifies the spectral difficulty. We also obtain non-asymptotic lower bounds for eigenvalue estimation under a relative squared-error loss. We then construct a computable wavelet projection estimator based on Coiflet scaling functions and a quadrature scheme designed to accommodate arbitrary H\"older smoothness. For eigenfunction estimation, this estimator matches the minimax dependence on the sample size and grid resolution, up to the natural spectral factor, for any H\"older index $\alpha>0$. Finally, we show that the proposed framework covers several classical Gaussian processes and Karhunen--Lo\`eve constructions. In particular, a Karhunen--Lo\`eve based criterion links spectral decay, eigenfunction regularity and covariance-kernel smoothness, and yields controlled simulation settings illustrating the predicted phase transitions and least-favourable discretization effects.
[62] arXiv:2605.06438 [pdf, ps, other]: Title: Neural-Actuarial Longevity Forecasting: Anchoring LSTMs for Explainable Risk Management

Authors: Davide Rindori

Comments: 26 pages, 12 figures. Code available at this https URL

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Risk Management (q-fin.RM)

Traditional multi-population models, such as the Li-Lee framework, rely on the assumption of mean-reverting country-specific deviations. However, recent data from high-longevity clusters suggest a systemic break in this paradigm. We identify a stationarity paradox where mortality residuals in countries like Sweden and West Germany exhibit persistent unit roots, leading to a systematic mispricing of longevity risk in linear models. To address these non-linearities, we propose Hybrid-Lift, a neural-actuarial framework that combines Hierarchical LSTM networks with a Mean-Bias Correction (MBC) anchoring mechanism. Positioned as a governance-friendly model challenger rather than a replacement of classical approaches, the framework exhibits selective superiority on out-of-sample validation (2012-2020): it outperforms Li-Lee by 17.40% in Sweden and 12.57% in West Germany, while remaining comparable for near-linear regimes such as Switzerland and Japan. We complement the predictive model with an integrated governance suite comprising SHAP-based cross-country influence mapping, a dual uncertainty framework for regulatory capital calibration (Swiss ES 99.0% of +1.153 years), and a reverse stress test identifying the critical shock threshold for solvency buffer exhaustion. This research provides evidence that neural networks, when properly anchored by actuarial principles, can serve as effective model challengers for longevity risk management under the SST and Solvency II standards.
[63] arXiv:2605.06479 [pdf, ps, other]: Title: Risk-Controlled Post-Processing of Decision Policies

Authors: Sunay Joshi, Tao Wang, Hamed Hassani, Edgar Dobriban

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)

Predictive models are often deployed through existing decision policies that stakeholders are reluctant to change unless a risk constraint requires intervention. We study risk-controlled post-processing: given a deterministic baseline policy, choose a new policy that maximizes agreement with the baseline subject to a chance constraint on a user-specified loss. At the population level, we show that the optimal policy has a threshold structure: it follows the baseline except on contexts where switching to the oracle fallback policy yields a large reduction in conditional violation risk. At the finite-sample level, given a fitted fallback policy and score, we develop a post-processing algorithm that uses calibration data to select a threshold. Leveraging tools from algorithmic stability and stochastic processes, we show that under regularity conditions, in the i.i.d. setting, the expected excess risk of the post-processed policy is $O(\log n/n)$. In the special case when an exact-safe fallback policy is available, the algorithm achieves precise expected risk control under exchangeability. In this setting, we also give high-probability near-optimality guarantees on the post-processed policy. Experiments on a COVID-19 radiograph diagnosis task, an LLM routing problem, and a synthetic multiclass decision task show that targeted post-processing can meet or nearly meet risk budgets while preserving substantially more agreement with the baseline than score-blind random mixing.
[64] arXiv:2605.06484 [pdf, ps, other]: Title: Estimate Level Adjustment For Inference With Proxies Under Random Distribution Shifts

Authors: Steven Wilkins-Reeves, Alexandra N. M. Darmon, Deeksha Sinha

Comments: 10 pages, 5 figures

Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)

In many scientific domains, including experimentation, researchers rely on measurements of proxy outcomes to achieve faster and more frequent reads, especially when the primary outcome of interest is challenging to measure directly. While proxies offer a more readily accessible observation for inference, the ultimate goal is to draw statistical inferences about the primary outcome parameter and proxy data are typically imperfect in some ways. To correct for these imperfections, current statistical inference methods often depend on strict identifying assumptions (such as surrogacy, covariate/label shift, or missingness assumptions). These assumptions can be difficult to validate and may be violated by various additional sources of distribution shift, potentially leading to biased parameter estimates and miscalibrated uncertainty quantification. We introduce an estimate-level framework, inspired by domain adaptation techniques, to empirically calibrate proxy-based inference. This framework models the proxy-primary metric discrepancy as a random effect at the parameter level, estimating its distribution from aggregated historical observations across past domains (e.g., experiments, time periods, or distinct segments). This method avoids the requirement for retaining individual-level response data. Additionally, this adjustment can be layered on top of existing proxy-correction methods (such as prediction-powered inference or importance weighting) to account for additional biases not addressed by those corrections. To manage uncertainty when the number of historical domains is limited, we provide both a method-of-moments estimator and a domain bootstrap procedure. We further validate this approach using publicly available datasets and real-world experiments.
[65] arXiv:2605.06496 [pdf, ps, other]: Title: Bivariate Frank Copula: Some More Results on Point Estimation of the Association Parameter from a Bayesian Perspective and Revisiting the Goodness of Fit Tests with an Application to Model Groundwater Data from Dong Thap, Vietnam

Authors: Thi-Yen-Anh Pham, Dung T. Nguyen, Nabendu Pal

Comments: 30 pages, 5 figures

Subjects: Methodology (stat.ME); Applications (stat.AP)

This work has two major parts. First, we extend the recent study of Pham et al. (2025) on point estimation of the association parameter of a bivariate Frank copula. We investigate two Bayes estimators under the generalized flat prior and the Jeffreys prior, and compare them with the maximum likelihood estimator (MLE). Simulation results show that, for small sample sizes (n <= 25), the Bayes estimator under the Jeffreys prior uniformly outperforms both the generalized flat prior estimator and the MLE in terms of mean squared error (MSE). For moderate and large sample sizes, all estimators have very similar performances in terms of bias and MSE. We also discuss computational issues in the R package implementation that may significantly affect the computation of the MLE for very small samples.
In the second part, we apply the Frank copula to analyze the association between groundwater arsenic concentration and other hydrochemical variables using a recent dataset from Vietnam. We revisit the goodness-of-fit tests proposed by Genest et al. (2006), investigate several non-intuitive behaviors of the test statistics, and provide extensive simulated critical value tables. Our results complement and refine the computational findings reported in the earlier literature.
[66] arXiv:2605.06521 [pdf, ps, other]: Title: Time-sensitive anytime-valid testing

Authors: Eugenio Clerico, Tobias Wegel, Iskander Azangulov, Patrick Rebeschini

Subjects: Statistics Theory (math.ST); Optimization and Control (math.OC)

Anytime-valid tests allow evidence to be checked during data collection: one can either continue testing or stop and reject the null while still controlling type-I error. Yet, in many applications rejection is useful only if it comes soon enough. We introduce a time-sensitive testing-by-betting framework that favours early rejection by assigning rewards to rejection times and maximising their expected value under a given alternative. This encompasses hard deadlines and softer time preferences. The resulting optimal control problem admits a Bellman representation in terms only of time and evidence against the null, rather than the full history. For hard deadlines, the simple-vs-simple case reduces to a finite-horizon Neyman--Pearson problem and identify the corresponding optimal e-process. Furthermore, we show that exponentially decaying rewards admit a stationary approximation, yielding the exponential-decay-optimal (EDO) criterion: a finite-time-scale counterpart to the classical growth-rate-optimal (GRO) viewpoint in anytime-valid statistics, with the GRO criterion recovered in the large-time-scale limit.
[67] arXiv:2605.06528 [pdf, ps, other]: Title: QUBO-Based Calibration for Regression Trees

Authors: Iro René Kouarfate, Maxime Dion, Anne MacKay, Mathieu Pigeon

Subjects: Computation (stat.CO)

Tree-based regression models are widely used in supervised learning, with the Classification and Regression Tree (CART) algorithm serving as a standard reference. CART construction involves solving a sequence of split-selection optimization problems. For categorical predictors, this problem can be formulated as a combinatorial fractional optimization problem. This structure makes the exact optimization computationally challenging and leads to standard implementations that rely on greedy heuristics, which may result in suboptimal splits. In this work, we reformulate this fractional problem and apply Dinkelbach (1967) algorithm to convert it into a Quadratic Unconstrained Binary Optimization (QUBO) problem. Using state-of-the-art QUBO solvers, we obtain QUBO-based regression trees with predictive performance comparable to standard CART while yielding higher-quality split solutions. These results highlight the potential of QUBO formulations for improving tree-based learning methods and open perspectives for future hybrid classical--quantum implementations.
[68] arXiv:2605.06564 [pdf, ps, other]: Title: Dynamic Treatment on Networks

Authors: Bengusu Nar, Jiguang Li, Veronika Ročková, Panos Toulis

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

In networks, effective dynamic treatment allocation requires deciding both whom to treat and also when, so as to amplify policy impact through spillovers. An early intervention at a well-connected node can trigger cascades that change which nodes are worth targeting in the next period. Existing treatment strategies under network interference are largely static while dynamic treatment frameworks typically ignore network structure altogether. We integrate these perspectives and propose Q-Ising, a three-stage pipeline that (i) estimates network adoption dynamics via a Bayesian dynamic Ising model from a single observed panel, (ii) augments treatment adoption histories with continuous posterior latent states, and (iii) learns a dynamic policy via offline reinforcement learning. The Bayesian mechanism enables uncertainty quantification over dynamic decisions, yielding posterior ensemble policies with interpretable spillover estimates. We provide a finite-sample regret upper bound that decomposes into standard offline-RL uncertainty, network abstraction error, and first stage error in Ising state estimation. We apply our method to data from Indian village microfinance networks and synthetic stochastic block models under simulated heterogeneous susceptible-infected-susceptible (SIS) dynamics and demonstrate that adaptive targeting outperforms static centrality benchmarks.
[69] arXiv:2605.06568 [pdf, ps, other]: Title: Statistical Significance Revisited

Authors: Reason Machete

Comments: 30 pages, 2 figures

Subjects: Other Statistics (stat.OT)

Since its introduction by Fisher, the method of hypothesis testing that relies on computing error probabilities has witnessed several developments. Perhaps the most significant development was the seminal contributions of Neyman and Pearson who brought in the concept of the alternative hypothesis with its corresponding error of the second kind. Significance tests have played a major role in various scientific and technological developments, but not without controversies. Although originally cast as frequentist approaches, Bayesian ideas have been incorporated into significance tests, widening access to them. The quantities central to computations of error probabilities are the sampling distributions, which can be computed even without thresholds or alternative hypotheses. Even though Fisher used the significance threshold of 0.05 in his calculations, he cautioned against prescribing any specific threshold. Recently, there have been calls for reformation in practice with regard to the almost standard use of the significance threshold of 0.05, prepublication confirmatory studies, the dichotomous consideration of the null and alternative hypothesis and abandoning significance tests altogether in favour of other approaches such as confidence intervals and Bayesian decision theory. In this paper, we examine these calls for reform and unearth their strengths and short comings.
[70] arXiv:2605.06581 [pdf, ps, other]: Title: History-Aware Conformal Prediction Sets for Censored Time-to-Event Outcomes

Authors: Yuyao Wang, Alexander W. Levis, Shu Yang, Larry Han

Subjects: Methodology (stat.ME)

Existing conformal prediction methods for time-to-event outcomes leverage only baseline covariates, producing prediction intervals that are insufficiently informative to facilitate decision making. We propose History-Aware Prediction Sets (HAPS), a conformal framework that constructs prediction sets for individual event times using covariate histories observed up to a decision time, targeting coverage among individuals who have survived to this time. HAPS handles right censoring adjusted for time-varying confounders via inverse probability of censoring weighting. When the censoring weights are consistently estimated, it achieves PAAC (probably asymptotically approximately correct) coverage among survivors. We further propose two doubly robust extensions of HAPS to weaken reliance on consistent estimation of the censoring distribution. In simulations, HAPS and its extensions reduce median prediction interval length by up to 75\% relative to baseline comparators while maintaining close to nominal coverage. On two public benchmark data sets, HAPS reduces the median interval length by up to 60\% for predictions at year 5, compared to the baseline comparators.
[71] arXiv:2605.06590 [pdf, ps, other]: Title: Unbiased estimation in two-stage adaptive enrichment designs

Authors: Enyu Li, Nigel Stallard, Ekkehard Glimm, Peter K. Kimani

Subjects: Methodology (stat.ME)

Recent advances in biomedical research have identified an increasing number of biomarkers associated with heterogeneity in patient responses to medical treatments. When a treatment is suspected to benefit certain patient subpopulations, adaptive enrichment designs may be more efficient and ethical. In such designs, an interim analysis is incorporated during the trial to select patient subpopulations for which the experimental treatment appears promising, according to predefined subpopulation selection rules. However, data-dependent selection can induce selection bias, causing conventional maximum likelihood estimators (MLEs) to overestimate the treatment effect in the selected patient subgroup. Existing inference methods for addressing this bias are typically rule-specific, highlighting the need for an estimation framework that accommodate a broader class of subpopulation selection rules. In this work, we define a general class of subpopulation selection rules based on the sample space partition condition and provide a systematic derivation that yields a unified formula for the Uniformly Minimum Variance Conditional Unbiased Estimator (UMVCUE). This generality allows our formulation to encompass a wide spectrum of adaptive enrichment designs, eliminating the necessity for case-specific derivations for each new design. Extensive simulations confirm the unbiasedness of the proposed UMVCUE, ensuring that therapeutic benefits are not overestimated. By bridging the gap between flexible interim subpopulation selection and rigorous statistical inference, our framework has the potential to facilitate the implementation of diverse subpopulation selection rules with greater ease in real-world trials and promote more efficient and ethical drug development.
[72] arXiv:2605.06608 [pdf, ps, other]: Title: DARTS: Targeting Prognostic Covariates in Budget-Constrained Sequential Experiments

Authors: Kateryna Husar, Alexander Volfovsky

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)

Randomized controlled trials typically assume that prognostic covariates are known and available at no cost. In practice, obtaining high-dimensional pretreatment data is costly, forcing a trade-off between covariate-adaptive precision and a measurement budget. We introduce Dynamic Adaptive Rerandomization via Thompson Sampling (DARTS), which treats covariate acquisition as a sequential optimization problem embedded within a design-based causal inference task. A budgeted combinatorial Thompson sampler learns which covariates are most prognostic across successive batches; selected covariates then drive rerandomization and regression adjustment to reduce batch-level average treatment effect variance. Our primary theoretical contribution is a decoupling result: adaptive covariate selection based on past batches preserves batch-level randomization validity, and the cumulative inverse-variance weighted estimator achieves at least nominal asymptotic coverage. We further derive a Bayes risk bound for the acquisition layer that matches the minimax lower bound up to logarithmic factors. Empirically, DARTS systematically concentrates the budget on informative features, significantly closing the efficiency gap to oracle designs while maintaining strict inferential validity.
[73] arXiv:2605.06655 [pdf, ps, other]: Title: Improving Variance Estimation for Covariate Adjustment with Binary Outcomes

Authors: Kaitlyn Lee, Alex Ocampo, Courtney Schiffman, Michael Friesenhahn, Christina Rabe, Michael Rosenblum

Subjects: Methodology (stat.ME)

Covariate adjustment is a general method for improving precision when estimating treatment effects in randomized trials and is recommended by the FDA in its 2023 guidance when baseline variables are prognostic for the primary outcome. We focus on a method highlighted in that guidance called ``standardization" (or ``g-computation") for estimating the marginal treatment effect. We address the question of how to reliably estimate variance for binary outcomes when marginal outcome probabilities are close to 0 or 1. We propose an influence function-based leave-one-out cross-validated (IF-LOO) variance estimator for the standardized difference-in-means average treatment effect. Through simulation studies, we show that this estimator provides appropriate type-I error control and performs reliably in challenging settings where existing methods can yield inflated type-I error or fail entirely, such as when outcome events are rare or sample sizes are small. In addition to having desirable statistical properties, we derive a closed-form expression for the proposed estimator, enabling straightforward and reliable implementation by study statisticians. The robust finite-sample performance and ease of implementation suggest the IF-LOO variance estimator is a prudent default choice for standardization in clinical trials.

Cross-lists for Fri, 8 May 26

[74] arXiv:2605.05341 (cross-list from cs.LG) [pdf, ps, other]: Title: Feature Starvation as Geometric Instability in Sparse Autoencoders

Authors: Faris Chaudhry, Keisuke Yano, Anthea Monod

Comments: 26 pages, 3 figures, 5 tables

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)

Sparse autoencoders (SAEs) are used to disentangle the dense, polysemantic internal representations of large language models (LLMs) into interpretable, monosemantic concepts. However, standard $\ell_1$-regularized SAEs suffer from feature starvation (dead neurons) and shrinkage bias, often requiring computationally expensive heuristic resampling and nondifferentiable hard-masking methods to bypass these challenges. We argue that feature starvation is not merely an empirical artifact of poor data diversity, but a fundamental optimization-geometric pathology of overcomplete dictionaries: the $\ell_1$-induced sparse coding map is unstable and fundamentally misaligned with shallow, amortized encoders. To address this structural instability, we introduce adaptive elastic net SAEs (AEN-SAEs), a fully differentiable architecture grounded in classical sparse regression. AEN-SAEs combine an $\ell_2$ structural term that enforces strong convexity and Lipschitz stability with adaptive $\ell_1$ reweighting that eliminates shrinkage bias and suppresses spurious features, thereby jointly controlling the curvature and interaction structure of the induced polyhedral geometry. Theoretically, we show that AEN-SAEs yield a Lipschitz-continuous sparse coding map and recover the global feature support under mild assumptions. Empirically, across synthetic settings and LLMs (Pythia 70M, Llama 3.1 8B), AEN-SAEs mitigate feature starvation without auxiliary heuristics while maintaining competitive reconstruction abilities.
[75] arXiv:2605.05480 (cross-list from cs.LG) [pdf, ps, other]: Title: GRALIS: A Unified Canonical Framework for Linear Attribution Methods via Riesz Representation

Authors: Raimondo Fanale

Comments: 25 pages, 6 tables, 2 figures. Theoretical framework with preliminary experimental validation on BreaKHis (1,187 images, DenseNet-121). Extended empirical comparison in preparation

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

The main XAI attribution methods for deep neural networks -- GradCAM, SHAP, LIME, Integrated Gradients -- operate on separate theoretical foundations and are not formally comparable. We present GRALIS (Gradient-Riesz Averaged Locally-Integrated Shapley), a mathematical framework establishing a representation theory for attributions: every additive, linear, and continuous attribution functional on L^2(Q,mu) admits a unique canonical representation (Q, w, Delta), proved necessary by the Riesz Representation Theorem. This class encompasses SHAP, IG, LIME and linearized GradCAM, but excludes nonlinear functionals such as standard GradCAM or attention maps. Seven formal theorems provide simultaneous guarantees absent in any individual method: (T1) necessary canonical form; (T2) exact completeness; (T3) Monte Carlo convergence O(1/sqrt(m))+O(1/k); (T4) exact Shapley Interaction Values; (T5) Hoeffding ANOVA decomposition; (T6) Sobol sensitivity generalization; (T7) multi-scale extension (MS-GRALIS) with minimum-variance weights. An algebraic appendix justifies the GRALIS-SIV correspondence via the Mobius transform without circularity. GRALIS satisfies 13.5/14 axiomatic properties vs. 2.5-6/14 for individual methods, including completeness, sensitivity, locality, order-k interactions and optimal multi-scale aggregation simultaneously. Preliminary validation on BreaKHis (1,187 histology images, DenseNet-121) reports deletion faithfulness AUC +0.015 (malignant), 96% class-conditional consistency, SAL = 0.762+/-0.109 and sparsity index 0.39. Extended comparison with baseline XAI methods is planned for a companion paper.
[76] arXiv:2605.05511 (cross-list from cs.LG) [pdf, ps, other]: Title: Non-Myopic Active Feature Acquisition via Pathwise Policy Gradients

Authors: Linus Aronsson, Morteza Haghir Chehreghani

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Active feature acquisition (AFA) considers prediction problems in which features are costly to obtain and the learner adaptively decides which feature values to acquire for each instance and when to stop and predict. AFA can be formulated as a partially observable Markov decision process (POMDP), which naturally admits a sequential decision-making perspective. In this paper, we present non-myopic pathwise policy gradients (NM-PPG), a new AFA method built around this formulation. We introduce a continuous relaxation of the acquisition process that enables pathwise gradients through the full acquisition trajectory, avoiding the high variance of standard score-function policy gradients while allowing end-to-end optimization of a non-myopic acquisition policy. To better align training with deployment, we further develop a straight-through rollout scheme that follows hard feature acquisitions in the forward pass while backpropagating through the corresponding soft relaxation in the backward pass. We stabilize optimization with entropy regularization and staged temperature sharpening. Experiments on both synthetic and real-world datasets demonstrate that NM-PPG yields superior performance relative to state-of-the-art AFA baselines.
[77] arXiv:2605.05520 (cross-list from cs.LG) [pdf, ps, other]: Title: Bayesian Rain Field Reconstruction using Commercial Microwave Links and Diffusion Model Priors

Authors: Badr Moufad, Albina Ilina, Hai Victor Habi, Salem Lahlou, Yazid Janati, Hagit Messer, Eric Moulines

Comments: Preprint

Subjects: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)

Commercial Microwave Links (CMLs) offer dense spatial coverage for rainfall sensing but produce path-integrated measurements that make accurate ground-level reconstruction challenging. Existing methods typically oversimplify CMLs as point sensors and neglect line integration relating rainfall to signal attenuation, resulting in degraded performance under heterogeneous precipitation. In this work, we view rain field reconstruction as a Bayesian inverse problem with Diffusion Models (DMs) as high-fidelity spatial priors. We show that diffusion models better preserve key rainfall statistics compared to censored Gaussian processes. Framing rainfall estimation as a Bayesian inverse problem with a DM prior enables training-free posterior sampling using a broad family of methods, including Plug-and-Play, Sequential Monte Carlo, and Replica Exchange methods. Experiments on synthetic and real-world datasets demonstrate consistent improvements over established CML-based reconstruction baselines.
[78] arXiv:2605.05521 (cross-list from econ.TH) [pdf, ps, other]: Title: An Axiomatic Foundation for Decisions with Counterfactual Utility

Authors: Benedikt Koch, Kosuke Imai, Tomasz Strzalecki

Subjects: Theoretical Economics (econ.TH); Computer Science and Game Theory (cs.GT); Statistics Theory (math.ST)

Counterfactual utilities evaluate decisions not only by the realized outcome under a given decision, but also by the counterfactual outcomes that would arise under alternative decisions. By generalizing standard utility frameworks, they allow decision-makers to encode asymmetric criteria, such as avoiding harm and anticipating regret. Recent work, however, has raised fundamental concerns about the coherence and transitivity of counterfactual utilities. We address these concerns by extending the von Neumann-Morgenstern (vNM) framework to preferences defined on the extended space of all potential outcomes rather than realized outcomes alone. We show that expected counterfactual utility satisfies the vNM axioms on this extended domain, thereby admitting a coherent preference representation. We further examine how counterfactual preferences map onto the realized outcome space through menu-dependent and context-dependent projections. This axiomatic framework reconciles apparent inconsistencies highlighted by the Russian roulette example in the statistics literature and resolves the well-known Allais paradox from behavioral economics. We also derive an additional axiom required to reduce counterfactual utilities to standard utilities on the same potential outcome space, and establish an axiomatic foundation for additive counterfactual utilities, which satisfy a necessary and sufficient condition for point identification. Finally, we show that our results hold regardless of whether individual potential outcomes are deterministic or stochastic.
[79] arXiv:2605.05609 (cross-list from cs.LG) [pdf, ps, other]: Title: Optimal Contextual Pricing under Agnostic Non-Lipschitz Demand

Authors: Jianyu Xu, Yu-Xiang Wang

Comments: 30 pages, 1 figure, 1 table

Subjects: Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)

We study contextual dynamic pricing with linear valuations and bounded-support agnostic noise, whose induced demand curve may be non-Lipschitz with arbitrary jumps and atoms. Such discontinuities break the cross-context interpolation arguments used by smooth-demand pricing algorithms, while the best previous method achieved only $\tilde O(T^{3/4})$ regret. We propose Conservative-Markdown Redirect-UCB Pricing, a polynomial-time algorithm that combines randomized parameter estimation, conservative residual-grid probing, and confidence-based one-step redirection. Our algorithm achieves $\tilde O(T^{2/3})$ optimal regret, matching the known lower bounds of Kleinberg and Leighton (2003) up to logarithmic factors and improving over the previous upper bound of Xu and Wang (2022). Under stochastic well-conditioned contexts, this closes the long-existing open regret gap in linear-valuation contextual pricing under agnostic non-Lipschitz noise distribution.
[80] arXiv:2605.05656 (cross-list from math.HO) [pdf, ps, other]: Title: Notes on Transversality and Statistical Degeneracies in Distributional Models

Authors: R. Labouriau

Comments: 30 pages

Subjects: History and Overview (math.HO); Methodology (stat.ME)

These notes provide a pedagogical introduction to the role of transversality theory in the analysis of statistical degeneracies within the framework of distributional statistical models. The classical question of when a statistical model is well-behaved - in the sense of being identifiable, having non-singular Fisher information, and admitting robust estimation - is reformulated as a question about the geometry of a kernel-induced feature map. Statistical pathologies correspond to geometric degeneracies of this map, and transversality theory provides a precise language for understanding when and why such degeneracies are non-generic.
The exposition is organised in three parts. Part I surveys the statistical phenomena that motivate the geometric treatment: representation failure, non-identifiability, moment indeterminacy, singular information, nuisance parameters, and the Behrens-Fisher problem. Part II develops the necessary geometric toolkit - smooth maps, Sard's theorem, transversality, jets, stratifications, and the parametric transversality theorem - at a level accessible to students with a background in analysis and linear algebra but no prior exposure to differential topology. Part~III returns to the statistical problems of Part~I and shows how each one admits a unified geometric interpretation as a transversality condition on the feature map.
These notes are a pedagogical companion to the research paper Labouriau (2026) "Transversality and Geometric Regularisation in Distributional Statistical Models" (arXiv:2605.04536 [math.ST]), expanding its arguments with motivating examples, geometric intuition, and exercises aimed at advanced Master's and PhD students with a background in mathematical statistics and measure theory. They are designed to support seminars or reading groups.
[81] arXiv:2605.05685 (cross-list from cs.LG) [pdf, ps, other]: Title: Temporal Functional Circuits: From Spline Plots to Faithful Explanations in KAN Forecasting

Authors: Naveen Mysore

Comments: 9 pages, 4 figures, 6 tables, plus appendix. Under review at NeurIPS 2026

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Unlike MLPs, Kolmogorov-Arnold Networks (KANs) expose explicit learnable edge functions on every connection, enabling mechanistic explanation in time-series forecasting. This paper introduces Temporal Functional Circuits, a framework that transforms KAN edge functions from latent visualizations into faithful, temporally grounded explanations. Built on a gated residual KAN that decomposes forecasts into a linear base and a sparsely activated KAN correction, the framework (i) maps each edge to input lags via output-aware attribution, (ii) ranks edges by learned activation range, and (iii) validates faithfulness through edge-level interventions including zeroing and spline removal. Removing the learned B-spline component while retaining the base SiLU term degrades forecasts, providing evidence that the spline shape itself carries predictive value beyond the base activation. On four synthetic regimes of increasing complexity, the learned gate opens progressively wider as signal complexity grows. On regime-switching signals, gated KAN achieves 59% lower MSE than linear-only models. Across eight benchmarks, the gated architecture is competitive with linear, attention, and MLP alternatives, while providing interpretable edge functions that MLP-based corrections cannot offer.
[82] arXiv:2605.05705 (cross-list from math.NA) [pdf, ps, other]: Title: Convex-Geometric Error Bounds for Positive-Weight Kernel Quadrature

Authors: Satoshi Hayakawa

Comments: 22 pages

Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)

Kernel quadrature can exploit RKHS spectral structure and outperform Monte Carlo on smooth integrands, but optimized quadrature weights are generally signed and may be numerically unstable. We study whether spectral acceleration remains possible when the weights are constrained to be positive, i.e., simplex weights. In the exact-target fixed-pool setting, an evaluated i.i.d. candidate pool of size $N$ is already available and the task is to reweight it so as to approximate the kernel mean embedding. We show that this positive reweighting problem is governed not by the equal-weight empirical average, but by the random convex hull generated by the pool. Our main geometric result shows that the mean of a bounded $d$-dimensional random vector can be approximated by a convex combination of $N$ i.i.d. samples at accuracy $O(d/N)$ with high probability, sharper than equal-weight averaging in the fixed-dimensional regime. We transfer this $d$-dimensional convex-hull approximation to full RKHS worst-case error through an augmented Mercer-truncation argument. The resulting positive-weight KQ bounds consist of a spectral tail term and a finite-sample convex-hull term, yielding Monte-Carlo-beating rates in favorable spectral regimes, including near-$O(1/N)$ rates up to logarithmic factors under exponential spectral decay. We also provide a constructive Frank--Wolfe algorithm that operates directly on the pool atoms, maintains simplex weights, and admits an explicit optimization-error bound.
[83] arXiv:2605.05890 (cross-list from cs.LG) [pdf, ps, other]: Title: RepFlow: Representation Enhanced Flow Matching for Causal Effect Estimation

Authors: Yifei Xie, Jian Huang

Subjects: Machine Learning (cs.LG); Methodology (stat.ME)

Estimating causal effects from observational data has become increasingly critical in diverse fields including healthcare, economics, and social policy. The fundamental challenge in causal inference arises from the missing counterfactuals and the selection bias. Existing methods are largely limited to point estimates and lack the capacity for distribution modeling. In this work, we propose RepFlow, a novel framework that formulates causal effect estimation as a joint optimization problem integrating representation learning with Conditional Flow Matching (CFM).
RepFlow mitigates selection bias by minimizing the entropically regularized Wasserstein distance between treated and control representations.
To enhance numerical stability, we further introduce an $L_2$ normalization constraint on latent representations.
This balanced representation enables the flow model to accurately capture the distribution of potential outcomes. Extensive experiments across a wide range of benchmarks demonstrate that RepFlow consistently outperforms existing methods in both point and distributional causal effect estimation.
[84] arXiv:2605.05967 (cross-list from cs.LG) [pdf, ps, other]: Title: Sharper Guarantees for Misspecified Kernelized Bandit Optimization

Authors: Davide Maran, Csaba Szepesvári

Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

Existing guarantees for misspecified kernelized bandit optimization pay for misspecification through kernel complexity: in generic offline bounds, the misspecification level $\varepsilon$ is multiplied by $\sqrt{d_\mathrm{eff}}$, where $d_\mathrm{eff}$ is the kernel effective dimension, while in online regret bounds, the corresponding penalty is $\sqrt{\gamma_n}\,n\varepsilon$, where $\gamma_n$ is the maximum information gain after $n$ rounds of interaction.
In this work, we show that, for a large class of kernels, the misspecification amplification can be reduced to logarithmic or polylogarithmic growth. In the offline setting, we first prove high-probability simple-regret bounds whose misspecification term is governed by a spectral Lebesgue constant. This yields logarithmic amplification for one-dimensional monotone spectra and polylogarithmic amplification for multivariate Fourier-diagonal product kernels. In the online setting, we modify a domain-splitting algorithm and prove a cumulative regret bound of $\widetilde{\mathcal O}(\sqrt{\gamma_n n}+n\varepsilon)$ under mild localized eigendecay assumptions, removing the extra $\sqrt{\gamma_n}$ factor from the misspecification term. The common principle is localization: spectral localization controls the Lebesgue constant of the offline approximation operator, while domain splitting implements the spatial analogue of this mechanism in the online setting, preventing local misspecification errors from being amplified globally.
[85] arXiv:2605.06004 (cross-list from cs.LG) [pdf, ps, other]: Title: A Fine-Grained Understanding of Uniform Convergence for Halfspaces

Authors: Aryeh Kontorovich, Kasper Green Larsen

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)

We study the fine-grained uniform convergence behavior of halfspaces beyond worst-case VC bounds. For inhomogeneous halfspaces in $\mathbb{R}^d$ with $d\ge 2$, we show that standard first-order VC bounds are essentially tight: even consistent hypotheses can incur population error $\Theta(d\ln(n/d)/n)$, and in the agnostic setting the deviation scales as $\sqrt{\tau\ln(1/\tau)}$ at true error $\tau$. In contrast, homogeneous halfspaces in $\mathbb{R}^2$ exhibit a markedly different behavior. In the realizable case, every hypothesis consistent with the sample has error $O(1/n)$. In the agnostic case, we prove a bandwise, log-free deviation bound on each dyadic risk band via a critical-wedge localization argument. Unioning over bands incurs only a $\ln\ln n$ overhead, and we establish a matching lower bound showing this overhead is unavoidable. Together, these results give a fine-grained and nearly complete picture of uniform convergence for halfspaces, revealing sharp dimensional and structural thresholds.
[86] arXiv:2605.06152 (cross-list from cs.LG) [pdf, ps, other]: Title: Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes

Authors: Liu Hanqing, Jianjun Cao, Yuanze Li, Zijian Zhou

Comments: 28 pages, 13 figures

Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Optimization and Control (math.OC); Machine Learning (stat.ML)

Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the "Slingshot Mechanism." Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. This paper proves that this phenomenon is a result of floating-point arithmetic precision limits. As training enters a high-confidence stage, the difference between the correct-class logit and the other logits may exceed the absorption-error threshold. Then during backpropagation, the gradient of the correct class is rounded exactly to zero, while the gradients of the incorrect classes remain nonzero. This breaks the zero-sum constraint of gradients across classes and introduces a systematic drift in the parameter update of the classifier layer. We prove that this drift forms a positive feedback loop with the feature, causing the global classifier mean and the global feature mean to grow exponentially. We call this mechanism Numerical Feature Inflation (NFI). This mechanism explains the rapid norm growth before a Slingshot spike, the subsequent reappearance of gradients, and the resulting loss spike. We further show that NFI is not equivalent to an observed loss spike: in more practical tasks, partial absorption may not produce visible spikes, but it can still break the zero-sum constraint and drive rapid growth of parameter norms. Our results reinterpret Slingshot as a numerical dynamic of finite-precision training, and provide a testable explanation for abnormal parameter growth and logit divergence in late-stage training.
[87] arXiv:2605.06202 (cross-list from cs.LG) [pdf, ps, other]: Title: Bandit Learning in General Open Multi-agent Systems

Authors: Mengfan Xu

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Recent developments in digital platforms have highlighted the prevalence of open systems, where agents can arrive and depart over time. While bandit learning in open systems has recently received initial attention, existing work imposes structural assumptions that are frequently violated in practice. A learning paradigm for general open systems creates fresh challenges: newly arriving agents induce endogenous non-stationarity; agent patterns determine how quickly information accumulates; and new agents make regret scale further with the time horizon. To this end, we formulate a unified open-system bandit problem with general dynamics, including heterogeneous rewards and general agent patterns. We introduce new concepts to capture the inherent complexities: the \emph{pre-training degree} of new agents quantifies how much information an agent carries upon entry, \emph{stability} measures the impact of new agents on the system, and \emph{global dynamic regret} compares the cumulative expected reward of all active agents with that of the varying optimal arms. We develop certified global-UCB learning methodologies with provable guarantees. Our regret bounds reveal that entry uncertainty enters linearly via the pre-training degree, while in stable regimes, regret is governed by the time needed to identify a persistent optimal arm, as well as by the agent patterns. We further show that these dependencies are tight via lower bounds in hard instances.
[88] arXiv:2605.06295 (cross-list from cs.LG) [pdf, ps, other]: Title: Attributions All the Way Down? The Metagame of Interpretability

Authors: Hubert Baniecki, Przemyslaw Biecek, Fabian Fumagalli

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $\phi(f)$ explaining a model $f$, we measure the directional influence of feature $j$ on the attribution of feature $i$, denoted as meta-attribution $\varphi_{j \to i}(f)$, by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices. Empirically, we demonstrate that the metagame delivers insights across diverse interpretability applications: (i) quantifying token interactions in instruction-tuned language models, (ii) explaining cross-modal similarity in vision-language encoders, and (iii) interpreting text-to-image concepts in multimodal diffusion transformers.
[89] arXiv:2605.06333 (cross-list from cs.CV) [pdf, ps, other]: Title: TinyBayes: Closed-Form Bayesian Inference via Jacobi Prior for Real-Time Image Classification on Edge Devices

Authors: Shouvik Sardar, Sourish Das

Comments: 14 Pages, 1 Figure, 4 Tables

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)

Cocoa (Theobroma cacao) is a critical cash crop for millions of smallholder farmers in West Africa, where Cocoa Swollen Shoot Virus Disease (CSSVD) and anthracnose cause devastating yield losses. Automated disease detection from leaf images is essential for early intervention, yet deploying such systems in resource-constrained settings demands models that are small, fast, and require no internet connectivity. Existing edge-deployable plant disease systems rely on end-to-end deep learning without uncertainty quantification, while Bayesian methods for edge devices focus on hardware-level inference architectures rather than agricultural applications. We bridge this gap with TinyBayes, the first framework to combine a closed-form Bayesian classifier with a mobile-grade computer vision pipeline for crop disease detection. Our pipeline uses YOLOv8-Nano (5.9 MB) for lesion localisation, MobileNetV3-Small (3.5 MB) for feature extraction, and the Jacobi prior; a Bayesian method that provides a closed form non-iterative estimators via projection, for the classification. The Jacobi-DMR (Distributed Multinomial Regression) classifier adds only 13.5 KB to the pipeline, bringing the total model size within 9.5 MB, while achieving 78.7% accuracy on the Amini Cocoa Contamination Challenge dataset and enabling end-to-end CPU inference under 150 ms per image. We benchmark against seven classifiers including Random Forest, SVM, Ridge, Lasso, Elastic Net, XGBoost, and Jacobi-GP, and demonstrate that the Jacobi-DMR offers the best trade-off between accuracy, model size, and inference speed for edge deployment. We have proved the asymptotic equivalence and consistency, asymptotic normality and the bias correction of Jacobi-DMR. All data and codes are available here: https://github.com/shouvik-sardar/TinyBayes
[90] arXiv:2605.06349 (cross-list from math.NA) [pdf, ps, other]: Title: Low-rank kernel methods for American option pricing

Authors: Michael Multerer, Paul Schneider, Chiara Segala

Subjects: Numerical Analysis (math.NA); Statistics Theory (math.ST)

We propose a scalable and theoretically grounded low-rank conditional expectation model for recursive Monte Carlo optimal stopping problems, in particular American option pricing. Our method reformulates the estimation of continuation values as a learning problem in a reproducing kernel Hilbert space, in which the conditional expectation is represented as a linear operator acting on future payoffs. This perspective yields an offline-online decomposition: the operator is learned once from simulated data and subsequently reused across all exercise dates, eliminating the need to recompute regression models at each step of the backward recursion. We establish convergence guarantees and derive bounds quantifying the approximation errors across exercise dates. Numerical experiments demonstrate the speed and accuracy of the proposed approach relative to extant methods.
[91] arXiv:2605.06352 (cross-list from cs.LG) [pdf, ps, other]: Title: Topological Signatures of Grokking

Authors: Yifan Tang, Qiquan Wang, Inés García-Redondo, Anthea Monod

Comments: 19 pages, 14 figures, 2 tables

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

We study the grokking phenomenon through the lens of topology. Using persistent homology on point clouds derived from the embedding matrices of a range of models trained on modular arithmetic with varying primes, we identify a clear and consistent topological signature of grokking: a sharp increase in both the maximum and total persistence of first homology ($H_1$). Persistence diagrams reveal the emergence of a dominant long-lived topological feature together with increasingly structured secondary features, reflecting the underlying cyclic structure of the task. Compared to existing spectral and geometric diagnostics -- specifically, Fourier analysis and local intrinsic dimension -- persistent homology provides a unified geometric and topological characterization of representation learning, capturing both local and global multi-scale structure. Ablations across data regimes and control settings show that these topological transitions are tied to generalization rather than memorization. Our results suggest that persistent homology offers a principled and interpretable framework for analyzing how neural networks internalize latent structure during training.
[92] arXiv:2605.06355 (cross-list from cs.LG) [pdf, ps, other]: Title: Order-Agnostic Autoregressive Modelling with Missing Data

Authors: Ignacio Peis, Pablo M. Olmos, Jes Frellsen

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Order-Agnostic autoregressive models have demonstrated strong performance in deep generative modeling, yet their use in settings with incomplete data remains largely unexplored. In this work, we reinterpret them through the lens of missing data. First, we show that their standard training procedure on fully observed data implicitly performs imputation under a missing completely at random mechanism, resulting in robust out-of-sample imputation performance in settings with high missingness. Second, we introduce the first principled framework for training them directly on incomplete datasets under general missingness mechanisms. Third, we leverage their amortized conditional density estimation to perform active information acquisition, i.e., sequentially selecting the most informative missing variables for downstream prediction or inference. Across a suite of real-world benchmarks, our Missingness-Aware Order-Agnostic Autoregressive Model (MO-ARM) consistently outperforms established imputation baselines.
[93] arXiv:2605.06375 (cross-list from cs.LG) [pdf, ps, other]: Title: A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment

Authors: Hao Yu

Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST)

Large language model (LLM) alignment via reinforcement learning from human preferences (RLHF) suffers from unstable policy updates, ambiguous gradient directions, poor interpretability, and high gradient variance in mainstream pairwise preference learning paradigms. To systematically address these limitations, we establish a unified theoretical framework for preference-based RL optimization centered on the Pair-GRPO family, comprising two tightly coupled variants: Soft-Pair-GRPO and Hard-Pair-GRPO. Soft-Pair-GRPO is a minimal modification of Group Relative Policy Optimization (GRPO) that replaces group-normalized scalar rewards with binary pairwise preference rewards, retaining GRPO's clipped surrogate and KL-regularized structure. We prove a critical gradient equivalence theorem: under first-order Taylor expansion around the current policy, Soft-Pair-GRPO's gradient is a positive scalar multiple of standard GRPO's gradient, explaining its empirical stability despite discarding continuous reward magnitudes. Building on this foundation, we propose Hard-Pair-GRPO, an advanced variant introducing explicit local probability constraints and constrained KL-fitting optimization to further suppress gradient noise and global policy drift. We provide comprehensive theoretical guarantees for both variants--including monotonic policy improvement, deterministic gradient direction, gradient-variance reduction, and dynamic step-size convergence. Extensive experiments on standard LLM alignment benchmarks (HH-RLHF,UltraFeedback) and the MuJoCo continuous control task HalfCheetah-v4 demonstrate that our Pair-GRPO family consistently outperforms state-of-the-art baselines in alignment quality, human preference win rate, training stability, and generalization to general reinforcement learning. Ablation studies validate the critical contributions of each core component.
[94] arXiv:2605.06386 (cross-list from econ.EM) [pdf, ps, other]: Title: Covariate Balancing and Riesz Regression Should Be Guided by the Neyman Orthogonal Score in Debiased Machine Learning

Authors: Masahiro Kato

Subjects: Econometrics (econ.EM); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)

This position paper argues that, in debiased machine learning, balancing functions should be derived from the Neyman orthogonal score, not chosen only as functions of covariates. Covariate balancing is effective when the regression error entering the score can be represented by functions of covariates alone, and it is the natural finite-dimensional approximation for targets such as ATT counterfactual means. For ATE estimation under treatment effect heterogeneity, however, the score error generally contains treatment-specific components because the outcome regression is a function of the full regressor $X=(D,Z)$. In that case, balancing common functions of $Z$ can leave the treatment-specific component unbalanced. We therefore advocate regressor balancing, implemented by Riesz regression with basis functions of $X$, as the general balancing principle for DML. The position is not that covariate balancing is invalid, but that covariate balancing should be understood as the special case that is appropriate when the score-relevant regression error is a function of covariates alone.
[95] arXiv:2605.06474 (cross-list from cs.LG) [pdf, ps, other]: Title: Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching

Authors: Xiang Li, Nan Jiang

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, and perhaps surprisingly, a data-dependent finite-sample guarantee for general function approximation can be established under only the realizability of $Q^\pi$, with a dimension-free bound -- that is, the error does not depend on the statistical complexity of the function class. We also establish connections to several existing methods, such as importance sampling and linear FQE. Further theoretical analyses shed new light on the nature of coverage, a concept of fundamental importance to offline RL.
[96] arXiv:2605.06520 (cross-list from cs.GT) [pdf, ps, other]: Title: Optimizing Social Utility in Sequential Experiments

Authors: Ander Artola Velasco, Stratis Tsirtsis, Manuel Gomez-Rodriguez

Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Methodology (stat.ME)

Regulatory approval of products in high-stakes domains such as drug development requires statistical evidence of safety and efficacy through large-scale randomized controlled trials. However, the high financial cost of these trials may deter developers who lack absolute certainty in their product's efficacy, ultimately stifling the development of `moonshot' products that could offer high social utility. To address this inefficiency, in this paper, we introduce a statistical protocol for experimentation where the product developer (the agent) conducts a randomized controlled trial sequentially and the regulator (the principal) partially subsidizes its cost. By modeling the protocol using a belief Markov decision process, we show that the agent's optimal strategy can be found efficiently using dynamic programming. Further, we show that the social utility is a piecewise linear and convex function over the subsidy level the principal selects, and thus the socially optimal subsidy can also be found efficiently using divide-and-conquer. Simulation experiments using publicly available data on antibiotic development and approval demonstrate that our statistical protocol can be used to increase social utility by more than $35$$\%$ relative to standard, non-sequential protocols.
[97] arXiv:2605.06541 (cross-list from cs.LG) [pdf, ps, other]: Title: Hedging Memory Horizons for Non-Stationary Prediction via Online Aggregation

Authors: Yutong Wang, Yannig Goude, Qiwei Yao

Comments: Preprint

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We study online prediction under distribution shift, where inputs arrive chronologically and outcomes are revealed only after prediction. In this setting, predictors must remain stable in quiet regimes yet adapt when regimes shift, and the right adaptation memory is unknown in advance. We propose MELO (Memory-hedged Exponentially Weighted Least-Squares Online aggregation), a model-agnostic method that hedges across adaptation scales: it wraps any non-anticipating base-predictor pool with exponentially weighted least-squares (EWLS) adaptation experts at multiple forgetting factors, and aggregates raw and EWLS-adapted forecasts with MLpol, a parameter-free online aggregation rule. Under boundedness conditions, we establish deterministic oracle inequalities showing that it competes with both the best raw predictor and the best bounded, time-varying affine combinations of the base predictions, up to a path-length-dependent tracking cost and a sublinear aggregation overhead. We evaluate MELO on French national electricity-load forecasting through the COVID-19 lockdown using no regime indicators, lockdown dates, or policy covariates. MELO reduces overall RMSE by 34.7\% relative to base-only MLpol and achieves lower overall RMSE than a TabICL reference supplied with an external COVID policy-response covariate. Moreover, MELO requires only lightweight per-step recursive updates without model retraining.
[98] arXiv:2605.06604 (cross-list from q-fin.CP) [pdf, ps, other]: Title: A Geometry-Aware Residual Correction of Hagan's SABR Implied Volatility Formula

Authors: Adil Reghai, Lama Tarsissi, Gérard Biau, Alex Lipton

Comments: 33 pages, 17 figures

Subjects: Computational Finance (q-fin.CP); Machine Learning (stat.ML)

This paper proposes a hybrid methodology to improve the approximation of SABR (Stochastic Alpha Beta Rho) implied volatility by combining analytical structure with machine learning. The approach augments the neural-network input representation with geometric features derived from the stochastic differential equations of the SABR model. Unlike approaches that fully replace analytical formulas with black-box models, the proposed framework preserves the analytical backbone of the model. The hybridization operates along two complementary dimensions. First, geometry-aware variables reflecting intrinsic properties of the SABR dynamics are used as structured inputs to the network. Second, the neural network is trained to learn the residual error relative to Hagan's closed-form approximation rather than implied volatility directly. The resulting model acts as a structured residual correction to the analytical formula, retaining interpretability while capturing higher-order effects that are not included in the asymptotic expansion. Numerical experiments conducted over realistic parameter domains, as well as stressed environments, show that the method improves accuracy and robustness compared with both analytical approximations and standard neural-network approaches. Because the correction remains lightweight and structurally consistent with the underlying model, the framework is well suited for real-time pricing and calibration in practical trading environments.
[99] arXiv:2605.06609 (cross-list from cs.LG) [pdf, ps, other]: Title: Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent

Authors: Chenyang Zhang, Yuan Cao

Comments: 94 pages, 8 figures

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Transformers have demonstrated remarkable in-context learning (ICL) capabilities. The strong ICL performance of transformers is commonly believed to arise from their ability to implicitly execute certain algorithms on the context, thereby enhancing prediction and generation. In this work, we investigate how transformers with softmax attention perform in-context learning on linear classification data. We first construct a class of multi-layer transformers that can perform in-context logistic regression, with each layer exactly performing one step of normalized gradient descent on an in-context loss. Then, we show that our constructed transformer can be obtained through (i) training a single self-attention layer supervised by one-step gradient descent, and (ii) recurrently applying the trained layer to obtain a looped model. Training convergence guarantees of the self-attention layer and out-of-distribution generalization guarantees of the looped model are provided. Our results advance the theoretical understanding of ICL mechanism by showcasing how softmax transformers can effectively act as in-context learners.
[100] arXiv:2605.06611 (cross-list from cs.LG) [pdf, ps, other]: Title: The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

Authors: Siquan Li, Kaiqi Jiang, Jiacheng Sun, Tianyang Hu

Comments: Accepted to ICML 2026

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, we trace its root to the value aggregation process inherent in self-attention, which induces a systematic variance discrepancy. We further demonstrate that this discrepancy is drastically amplified by the activation of super neurons within Feed-Forward Network (FFN) layers. Specifically, the channel-sparse down-projections trigger a dimension disparity of the first-token representation, necessitating the formation of attention sinks as a structural anchor. Then, we validate this causal chain through two controlled interventions: (i) isolating the aggregation effect via attention mask modifications and (ii) amplifying the variance of targeted token representations. Both interventions can replicate attention sinks at arbitrary positions. Our mechanistic understanding offers a foundation for the systematic control of sink formation. Finally, as a proof of concept, we propose \textit{head-wise RMSNorm}, an architectural modification that stabilizes value aggregation outputs during pre-training. Our experiments demonstrate that restoring statistical parity across positions significantly accelerates convergence.
[101] arXiv:2605.06612 (cross-list from cs.LG) [pdf, ps, other]: Title: Online Bayesian Calibration under Gradual and Abrupt System Changes

Authors: Yang Xu, Chiwoo Park

Subjects: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Machine Learning (stat.ML)

Bayesian model calibration is central to digital twins and computer experiments, as it aligns model outputs with field observations by estimating calibration parameters and correcting systematic model bias. Classical Bayesian calibration introduces latent parameters and a discrepancy function to model bias, but suffers from parameter--discrepancy confounding and is typically formulated as an offline procedure under a stationary data-generating assumption. These limitations are restrictive in modern digital twin applications, where systems evolve over time and may exhibit gradual drift and abrupt regime shifts. While data assimilation methods enable sequential updates, they generally do not explicitly model systematic bias and are less effective under abrupt changes. We propose Bayesian Recursive Projected Calibration (BRPC), an online Bayesian calibration framework for streaming data under simulator mismatch and nonstationarity. BRPC extends projected calibration to the online setting by separating a discrepancy-free particle update for calibration parameters from a conditional Gaussian process update for discrepancy, preserving identifiability while enabling bias-aware adaptation under gradual system evolution. To handle abrupt changes, BRPC is integrated with restart mechanisms that detect regime shifts and reset the calibration process. We establish theoretical guarantees for both components, including tracking performance under gradual evolution and false-alarm and detection behavior for restart mechanisms. Empirical studies on synthetic and plant-simulation benchmarks show that BRPC improves calibration accuracy under gradual changes, while restart-augmented BRPC further improves robustness and predictive performance under abrupt regime shifts compared to sliding-window Bayesian calibration and data assimilation baselines.

Replacements for Fri, 8 May 26

[102] arXiv:2304.11200 (replaced) [pdf, ps, other]: Title: A Plug-and-Play Method with Inpainting Network for Bayesian Uncertainty Quantification in Imaging

Authors: Xiaoyu Wang, Michael Tang, Audrey Repetti

Subjects: Methodology (stat.ME)
[103] arXiv:2306.01749 (replaced) [pdf, ps, other]: Title: Detecting Consumers' Financial Vulnerability using Open Banking Data: Evidence from UK Payday Loans

Authors: Victor Medina-Olivares, Raffaella Calabrese

Subjects: Applications (stat.AP); General Finance (q-fin.GN)
[104] arXiv:2410.20885 (replaced) [pdf, ps, other]: Title: A Distributed Lag Approach to the Generalised Dynamic Factor Model

Authors: Philipp Gersing

Subjects: Econometrics (econ.EM); Methodology (stat.ME)
[105] arXiv:2411.16666 (replaced) [src]: Title: CatNet: Controlling the False Discovery Rate in LSTM with SHAP Feature Importance and Gaussian Mirrors

Authors: Jiaan Han, Junxiao Chen, Yanzhe Fu

Comments: Withdrawn by the authors. The main theoretical result relies on an assumption that is not valid as stated. A substantially revised and corrected work will be posted separately

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
[106] arXiv:2412.00658 (replaced) [pdf, ps, other]: Title: Probabilistic Predictions of Option Prices with Modular Approximate Bayesian Inference

Authors: Worapree Maneesoonthorn, David T. Frazier, Gael M. Martin

Subjects: Statistical Finance (q-fin.ST); Computation (stat.CO); Methodology (stat.ME)
[107] arXiv:2412.02783 (replaced) [pdf, ps, other]: Title: Monotone representation and measurability of generalized $ψ$-estimators

Authors: Matyas Barczy, Zsolt Páles

Comments: 21 pages

Subjects: Statistics Theory (math.ST)
[108] arXiv:2504.11978 (replaced) [pdf, ps, other]: Title: On the Intersection and Composition properties of conditional independence

Authors: Tobias Boege

Comments: 21 pages; v3: minor revision and clarifications

Subjects: Information Theory (cs.IT); Statistics Theory (math.ST)
[109] arXiv:2504.16230 (replaced) [pdf, ps, other]: Title: Robust Causal Inference for EHR-based Studies of Point Exposures with Missingness in Eligibility Criteria

Authors: Luke Benz, Rajarshi Mukherjee, Rui Wang, David Arterburn, Heidi Fischer, Catherine Lee, Susan M. Shortreed, Sebastien Haneuse, Alexander W. Levis

Subjects: Methodology (stat.ME); Applications (stat.AP)
[110] arXiv:2505.08125 (replaced) [pdf, ps, other]: Title: Sharp Gaussian approximations for Decentralized Federated Learning

Authors: Soham Bonnerjee, Sayar Karmakar, Wei Biao Wu

Comments: Accepted as Spotlight, NeurIPS'25, Main Conference Track

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
[111] arXiv:2505.15064 (replaced) [pdf, ps, other]: Title: Why and When Deep is Better than Shallow: Implementation-Agnostic State-Transition Model of Deep Learning

Authors: Sho Sonoda, Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda

Subjects: Machine Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML)
[112] arXiv:2505.18879 (replaced) [pdf, ps, other]: Title: Efficient Online Random Sampling via Randomness Recycling

Authors: Thomas L. Draper, Feras A. Saad

Journal-ref: Proceedings of the 2026 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 2473-2511. Society for Industrial and Applied Mathematics, 2026

Subjects: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Information Theory (cs.IT); Probability (math.PR); Computation (stat.CO)
[113] arXiv:2507.00480 (replaced) [pdf, ps, other]: Title: Posterior Inference in Latent Space for Scalable Constrained Black-box Optimization

Authors: Kiyoung Om, Kyuil Sim, Taeyoung Yun, Hyeongyu Kang, Jinkyoo Park

Comments: 25 pages, 14 figures, 6 tables. Equal contribution by Kiyoung Om, Kyuil Sim, and Taeyoung Yun

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[114] arXiv:2507.12457 (replaced) [pdf, ps, other]: Title: Asymptotic Theory of $K$-fold Cross-validation in Lasso and the validity of Bootstrap

Authors: Mayukh Choudhury, Debraj Das

Subjects: Methodology (stat.ME)
[115] arXiv:2507.20941 (replaced) [pdf, ps, other]: Title: Multivariate Standardized Residuals for Conformal Prediction

Authors: Sacha Braun, Eugène Berta, Michael I. Jordan, Francis Bach

Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME); Other Statistics (stat.OT)
[116] arXiv:2507.22004 (replaced) [pdf, ps, other]: Title: Horseshoe Forests for High-Dimensional Causal Survival Analysis

Authors: Tijn Jacobs, Wessel N. van Wieringen, Stéphanie L. van der Pas

Subjects: Methodology (stat.ME); Machine Learning (stat.ML)
[117] arXiv:2508.17090 (replaced) [pdf, ps, other]: Title: Neural Stochastic Differential Equations on Compact State Spaces: Theory, Methods, and Application to Suicide Risk Modeling

Authors: Malinda Lu, Yue-Jane Liu, Matthew K. Nock, Yaniv Yacoby

Comments: Accepted at the Symposium on Probabilistic Machine Learning (ProbML) 2026, and at the Methods and Opportunities at Small Scale (MOSS), ICML 2025, Vancouver, Canada

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[118] arXiv:2509.14225 (replaced) [pdf, ps, other]: Title: Defending Diffusion Models Against Membership Inference Attacks via Higher-Order Langevin Dynamics

Authors: Benjamin Sterling, Yousef El-Laham, Mónica F. Bugallo

Comments: 11 pages, 4 figures

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[119] arXiv:2509.24814 (replaced) [pdf, ps, other]: Title: A Greedy PDE Router for Blending Neural Operators and Classical Methods

Authors: Sahana Rayan, Yash Patel, Ambuj Tewari

Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
[120] arXiv:2510.03949 (replaced) [pdf, ps, other]: Title: Analysis of kinetic Langevin Monte Carlo under the stochastic exponential Euler discretization from underdamped all the way to overdamped

Authors: Kyurae Kim, Samuel Gruffaz, Ji Won Park, Alain Oliviero Durmus

Comments: v3: fixed typos

Subjects: Computation (stat.CO); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML)
[121] arXiv:2510.08539 (replaced) [pdf, ps, other]: Title: On the optimization dynamics of RLVR: Gradient gap and step size thresholds

Authors: Joe Suk, Yaqi Duan

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
[122] arXiv:2510.18120 (replaced) [pdf, ps, other]: Title: Generalization Below the Edge of Stability: The Role of Data Geometry

Authors: Tongtong Liang, Alexander Cloninger, Rahul Parhi, Yu-Xiang Wang

Comments: Accepted by ICLR 2026

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[123] arXiv:2510.23254 (replaced) [pdf, ps, other]: Title: Optimal In-context Adaptivity and Distributional Robustness of Transformers

Authors: Tianyi Ma, Tengyao Wang, Richard J. Samworth

Comments: 47 pages, 4 figures

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
[124] arXiv:2511.03236 (replaced) [pdf, ps, other]: Title: Unbiased Regression-Adjusted Estimation of Average Treatment Effects in Randomized Controlled Trials

Authors: Alberto Abadie, Mehrdad Ghadiri, Ali Jadbabaie, Mahyar JafariNodeh

Subjects: Econometrics (econ.EM); Methodology (stat.ME)
[125] arXiv:2511.13664 (replaced) [pdf, ps, other]: Title: Rate-optimal and computationally efficient nonparametric estimation on the circle and the sphere

Authors: Athanasios G. Georgiadis, Andrew P. Percival

Subjects: Statistics Theory (math.ST); Applications (stat.AP)
[126] arXiv:2511.17292 (replaced) [pdf, ps, other]: Title: Balancing Evidentiary Value and Sample Size of Adaptive Designs with Application to Animal Experiments

Authors: Leonhard Held, Fadoua Balabdaoui, Saverio Fontana, Samuel Pawel

Comments: Main paper: 35 pages, 4 figures, 3 tables Supplementary Material: 17 pages, 5 figures, 1 table

Subjects: Methodology (stat.ME)
[127] arXiv:2512.06370 (replaced) [pdf, ps, other]: Title: Greedy Alignment Principle for Optimizer Selection

Authors: Jaerin Lee, Kyoung Mu Lee

Comments: 34 pages, 4 figures

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[128] arXiv:2512.09538 (replaced) [pdf, ps, other]: Title: Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search

Authors: Ekaterina Fadeeva, Maiya Goloburda, Aleksandr Rubashevskii, Roman Vashurin, Artem Shelmanov, Preslav Nakov, Mrinmaya Sachan, Maxim Panov

Subjects: Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG)
[129] arXiv:2601.01351 (replaced) [pdf, ps, other]: Title: Errors-in-variables regression for dependent data with estimated error covariance matrix: To prewhiten or not?

Authors: Jingkun Qiu, Hanyue Chen, Song Xi Chen

Subjects: Applications (stat.AP)
[130] arXiv:2601.04378 (replaced) [pdf, ps, other]: Title: Aligned explanations in neural networks

Authors: Corentin Lobet, Francesca Chiaromonte

Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[131] arXiv:2601.20571 (replaced) [pdf, ps, other]: Title: Fast and Efficient Gossip Algorithms for Robust and Non-smooth Decentralized Learning

Authors: Anna van Elst, Igor Colin, Stephan Clémençon

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[132] arXiv:2601.21831 (replaced) [pdf, ps, other]: Title: Generative Modeling of Discrete Data Using Geometric Latent Subspaces

Authors: Daniel Gonzalez-Alvarado, Jonas Cassel, Stefania Petra, Christoph Schnörr

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[133] arXiv:2602.01505 (replaced) [src]: Title: Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum

Authors: Navdeep Kumar, Tehila Dahan, Lior Cohen, Ananyabrata Barua, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor

Comments: Following further internal verification, we identified foundational issues in the analytical framework, including unresolved problems in the treatment of nonstationary sampling and parts of the coupled convergence analysis under the stated assumptions. Addressing these issues requires a substantial overhaul of the theoretical framework beyond a standard revision

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[134] arXiv:2602.03258 (replaced) [pdf, ps, other]: Title: Principled Federated Random Forests for Heterogeneous Data

Authors: Rémi Khellaf, Erwan Scornet, Aurélien Bellet, Julie Josse

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[135] arXiv:2602.07618 (replaced) [pdf, ps, other]: Title: Dense Neural Networks are not Universal Approximators

Authors: Levi Rauchwerger, Stefanie Jegelka, Ron Levie

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[136] arXiv:2602.07633 (replaced) [pdf, ps, other]: Title: Flow-Based Conformal Predictive Distributions

Authors: Trevor Harris

Comments: 9 pages, 15 figures, 20 appendix pages

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
[137] arXiv:2602.08318 (replaced) [pdf, ps, other]: Title: Is Flow Matching Just Trajectory Replay for Sequential Data?

Authors: Soon Hoe Lim, Shizheng Lin, Michael W. Mahoney, N. Benjamin Erichson

Comments: 56 pages

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
[138] arXiv:2602.08769 (replaced) [pdf, ps, other]: Title: The Unseen Species Problem Revisited

Authors: Edward Eriksson

Subjects: Statistics Theory (math.ST)
[139] arXiv:2602.17683 (replaced) [pdf, ps, other]: Title: Probabilistic NDVI Forecasting from Sparse Satellite Time Series and Weather Covariates

Authors: Irene Iele, Giulia Romoli, Daniele Molino, Elena Mulero Ayllón, Filippo Ruffini, Paolo Soda, Matteo Tortora

Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
[140] arXiv:2603.01192 (replaced) [pdf, ps, other]: Title: A Basin-Selection Perspective on Grokking via Singular Learning Theory

Authors: Ben Cullen, Sergio Estan-Ruiz, Riya Danait, Jiayi Li

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[141] arXiv:2603.04673 (replaced) [pdf, ps, other]: Title: sFRC for assessing hallucinations in medical image restoration

Authors: Prabhat Kc, Rongping Zeng, Nirmal Soni, Aldo Badano

Comments: 16 pages; 14 figures; 1 Supplemental document. TechRxiv Preprints, 2025

Subjects: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph); Machine Learning (stat.ML)
[142] arXiv:2603.04807 (replaced) [pdf, ps, other]: Title: Does Sparse Connectivity Improve Generalization? Convolutional Networks Below the Edge of Stability

Authors: Tongtong Liang, Esha Singh, Rahul Parhi, Alexander Cloninger, Yu-Xiang Wang

Comments: Under Review. Comments welcome!

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[143] arXiv:2603.11161 (replaced) [pdf, ps, other]: Title: Algorithmic Task Capture, Computational Complexity, and Inductive Bias of Infinite Transformers

Authors: Orit Davidovich, Zohar Ringel

Subjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
[144] arXiv:2603.13085 (replaced) [pdf, ps, other]: Title: Linearized Attention Cannot Enter the Kernel Regime at Any Practical Width

Authors: Jose Marie Antonio Miñoza, Paulo Mario P. Medina, Sebastian C. Ibañez

Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA); Machine Learning (stat.ML)
[145] arXiv:2603.13441 (replaced) [pdf, ps, other]: Title: Filtered Spectral Projection for Quantum Principal Component Analysis

Authors: Sk Mujaffar Hossain, Satadeep Bhattacharjee

Subjects: Machine Learning (stat.ML); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
[146] arXiv:2603.23055 (replaced) [pdf, ps, other]: Title: Post-Selection Distributional Model Evaluation

Authors: Amirmohammad Farzaneh, Osvaldo Simeone

Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
[147] arXiv:2603.27389 (replaced) [pdf, ps, other]: Title: Prediction-Based Markov Violation Scores for Detecting Non-Markovian Observations in Reinforcement Learning

Authors: Naveen Mysore

Comments: Accepted at RLC 2026, to appear in Reinforcement Learning Journal

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[148] arXiv:2604.05241 (replaced) [pdf, ps, other]: Title: Information Geometry and Asymptotic Theory for SMML Estimators

Authors: Enes Makalic, Daniel F. Schmidt

Subjects: Statistics Theory (math.ST)
[149] arXiv:2604.07096 (replaced) [pdf, ps, other]: Title: Are Stochastic Multi-objective Bandits Harder than Single-objective Bandits?

Authors: Changkun Guan, Mengfan Xu

Comments: 21 pages

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[150] arXiv:2604.11890 (replaced) [pdf, ps, other]: Title: Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

Authors: Sergey Alekseev

Comments: Minor text edits; 10 pages of main text; 34 pages total; 5 figures in the main text, 25 figures total; preprint

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[151] arXiv:2604.20568 (replaced) [pdf, ps, other]: Title: Amortized Vine Copulas for High-Dimensional Density and Information Estimation

Authors: Houman Safaai

Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Methodology (stat.ME)
[152] arXiv:2604.25565 (replaced) [pdf, ps, other]: Title: CBARA: Covariate-Balanced-and-Adjusted Response-Adaptive Randomization

Authors: Hengjia Fang, Wei Ma

Subjects: Methodology (stat.ME); Statistics Theory (math.ST)
[153] arXiv:2604.27307 (replaced) [pdf, ps, other]: Title: A Novel Computational Framework for Causal Inference: Tree-Based Discretization with ILP-Based Matching

Authors: Tianyu Yang, Md. Noor-E-Alam

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[154] arXiv:2605.00742 (replaced) [pdf, ps, other]: Title: Position: agentic AI orchestration should be Bayes-consistent

Authors: Theodore Papamarkou, Pierre Alquier, Matthias Bauer, Wray Buntine, Andrew Davison, Gintare Karolina Dziugaite, Maurizio Filippone, Andrew Y. K. Foong, Vincent Fortuin, Dimitris Fouskakis, Jes Frellsen, Eyke Hüllermeier, Theofanis Karaletsos, Mohammad Emtiyaz Khan, Nikita Kotelevskii, Salem Lahlou, Yingzhen Li, Fang Liu, Clare Lyle, Thomas Möllenhoff, Konstantina Palla, Maxim Panov, Yusuf Sale, Kajetan Schweighofer, Artem Shelmanov, Siddharth Swaroop, Martin Trapp, Willem Waegeman, Andrew Gordon Wilson, Alexey Zaytsev

Comments: Accepted for publication at ICML 2026

Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
[155] arXiv:2605.01193 (replaced) [pdf, ps, other]: Title: A Novel Exact Inference Approach for Log-Logistic Reliability Functions with Applications to Time-to-Event Data

Authors: Bowen Liu, Malwane M.A. Ananda, Sam Weerahandi

Comments: 12 pages, 4 figues, 7 tables

Subjects: Methodology (stat.ME); Applications (stat.AP)
[156] arXiv:2605.01669 (replaced) [pdf, ps, other]: Title: PRCD-MAP: Learning How Much to Trust Imperfect Priors in Causal Discovery

Authors: Xihang Shan, Da Zhou

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
[157] arXiv:2605.03061 (replaced) [pdf, ps, other]: Title: Dynamic Vine Copulas: Detecting and Quantifying Time-Varying Higher-Order Interactions

Authors: Houman Safaai, Alessandro Marin Vargas

Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Methodology (stat.ME)
[158] arXiv:2605.03063 (replaced) [pdf, ps, other]: Title: From Information Geometry to Jet Substructure: A Triality of Cumulant Tensors, Energy Correlators, and Hypergraphs

Authors: Aritra Bal, Markus Klute, Benedikt Maier, Michael Spannowsky

Comments: 31 pages, 8 figures, 3 tables

Subjects: High Energy Physics - Phenomenology (hep-ph); High Energy Physics - Experiment (hep-ex); Machine Learning (stat.ML)
[159] arXiv:2605.03178 (replaced) [pdf, ps, other]: Title: Structure Learning for Directed Trees with Zero-Inflated Compositional Nodes

Authors: Shuangjie Zhang, Bani K. Mallick, Yang Ni

Comments: 29 pages, 2 figures

Subjects: Methodology (stat.ME)
[160] arXiv:2605.03222 (replaced) [pdf, ps, other]: Title: Beyond Activation Alignment: The Geometry of Neural Sensitivity

Authors: Amirhossein Yavari, Farnaz Zamani Esfahlani

Comments: 9 pages, 4 figures

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[161] arXiv:2605.04457 (replaced) [pdf, ps, other]: Title: Penalized KLIC Model Selection for the Generalized Method of Moments in Longitudinal Data with Time-Dependent Covariates

Authors: Mahmud Hasan, Mathias Nthiani Muia, Mous-Abou Hamadou, Niloofar Ramezani

Comments: 31 pages, 1 figure

Subjects: Methodology (stat.ME); Applications (stat.AP); Computation (stat.CO)
[162] arXiv:2605.04961 (replaced) [pdf, ps, other]: Title: Efficient GMM and Weighting Matrix under Misspecification

Authors: Byunghoon Kang

Subjects: Econometrics (econ.EM); Methodology (stat.ME)
[163] arXiv:2605.05073 (replaced) [pdf, ps, other]: Title: Heterogeneous Judge-Aware Ranking with Sensitivity, Disagreement, and Confidence

Authors: Shibo Yu, Yingzhou Wang, Yan Chen, Guodong Li, Jin-Hong Du

Subjects: Methodology (stat.ME)
[164] arXiv:2605.05102 (replaced) [pdf, ps, other]: Title: Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning

Authors: Harin Lee, Min-hwan Oh

Comments: Accepted at the Conference of Learning Theory (COLT) 2026

Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

New submissions
Cross-lists
Replacements

[ total of 164 entries: 1-164 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, stat, recent, 2605, contact, help (Access key information)

> stat

Statistics

New submissions

New submissions for Fri, 8 May 26

Cross-lists for Fri, 8 May 26

Replacements for Fri, 8 May 26