Statistics
New submissions
[ showing up to 2000 entries per page: fewer | more ]
New submissions for Wed, 4 Feb 26
- [1] arXiv:2602.02577 [pdf, ps, other]
-
Title: Relaxed Triangle Inequality for Kullback-Leibler Divergence Between Multivariate Gaussian DistributionsSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
The Kullback-Leibler (KL) divergence is not a proper distance metric and does not satisfy the triangle inequality, posing theoretical challenges in certain practical applications. Existing work has demonstrated that KL divergence between multivariate Gaussian distributions follows a relaxed triangle inequality. Given any three multivariate Gaussian distributions $\mathcal{N}_1, \mathcal{N}_2$, and $\mathcal{N}_3$, if $KL(\mathcal{N}_1, \mathcal{N}_2)\leq \epsilon_1$ and $KL(\mathcal{N}_2, \mathcal{N}_3)\leq \epsilon_2$, then $KL(\mathcal{N}_1, \mathcal{N}_3)< 3\epsilon_1+3\epsilon_2+2\sqrt{\epsilon_1\epsilon_2}+o(\epsilon_1)+o(\epsilon_2)$. However, the supremum of $KL(\mathcal{N}_1, \mathcal{N}_3)$ is still unknown. In this paper, we investigate the relaxed triangle inequality for the KL divergence between multivariate Gaussian distributions and give the supremum of $KL(\mathcal{N}_1, \mathcal{N}_3)$ as well as the conditions when the supremum can be attained. When $\epsilon_1$ and $\epsilon_2$ are small, the supremum is $\epsilon_1+\epsilon_2+\sqrt{\epsilon_1\epsilon_2}+o(\epsilon_1)+o(\epsilon_2)$. Finally, we demonstrate several applications of our results in out-of-distribution detection with flow-based generative models and safe reinforcement learning.
- [2] arXiv:2602.02633 [pdf, ps, other]
-
Title: Rethinking Test-Time Training: Tilting The Latent Distribution For Few-Shot Source-Free AdaptationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Often, constraints arise in deployment settings where even lightweight parameter updates e.g. parameter-efficient fine-tuning could induce model shift or tuning instability. We study test-time adaptation of foundation models for few-shot classification under a completely frozen-model regime, where additionally, no upstream data are accessible. We propose arguably the first training-free inference method that adapts predictions to the new task by performing a change of measure over the latent embedding distribution induced by the encoder. Using task-similarity scores derived from a small labeled support set, exponential tilting reweights latent distributions in a KL-optimal manner without modifying model parameters. Empirically, the method consistently competes with parameter-update-based methods across multiple benchmarks and shot regimes, while operating under strictly and universally stronger constraints. These results demonstrate the viability of inference-level distributional correction for test-time adaptation even with a fully-frozen model pipeline.
- [3] arXiv:2602.02703 [pdf, ps, other]
-
Title: Selective Information Borrowing for Region-Specific Treatment Effect Inference under Covariate Mismatch in Multi-Regional Clinical TrialsSubjects: Methodology (stat.ME)
Multi-regional clinical trials (MRCTs) are central to global drug development, enabling evaluation of treatment effects across diverse populations. A key challenge is valid and efficient inference for a region-specific estimand when the target region is small and differs from auxiliary regions in baseline covariates or unmeasured factors. We adopt an estimand-based framework and focus on the region-specific average treatment effect (RSATE) in a prespecified target region, which is directly relevant to local regulatory decision-making. Cross-region differences can induce covariate shift, covariate mismatch, and outcome drift, potentially biasing information borrowing and invalidating RSATE inference. To address these issues, we develop a unified causal inference framework with selective information borrowing. First, we introduce an inverse-variance weighting estimator that combines a "small-sample, rich-covariate" target-only estimator with a "large-sample, limited-covariate" full-borrowing doubly robust estimator, maximizing efficiency under no outcome drift. Second, to accommodate outcome drift, we apply conformal prediction to assess patient-level comparability and adaptively select auxiliary-region patients for borrowing. Third, to ensure rigorous finite-sample inference, we employ a conditional randomization test with exact, model-free, selection-aware type I error control. Simulation studies show the proposed estimator improves efficiency, yielding 10-50% reductions in mean squared error and higher power relative to no-borrowing and full-borrowing approaches, while maintaining valid inference across diverse scenarios. An application to the POWER trial further demonstrates improved precision for RSATE estimation.
- [4] arXiv:2602.02753 [pdf, ps, other]
-
Title: Effect-Wise Inference for Smoothing Spline ANOVA on Tensor-Product Sobolev SpaceSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Functional ANOVA provides a nonparametric modeling framework for multivariate covariates, enabling flexible estimation and interpretation of effect functions such as main effects and interaction effects. However, effect-wise inference in such models remains challenging. Existing methods focus primarily on inference for entire functions rather than individual effects. Methods addressing effect-wise inference face substantial limitations: the inability to accommodate interactions, a lack of rigorous theoretical foundations, or restriction to pointwise inference. To address these limitations, we develop a unified framework for effect-wise inference in smoothing spline ANOVA on a subspace of tensor product Sobolev space. For each effect function, we establish rates of convergence, pointwise confidence intervals, and a Wald-type test for whether the effect is zero, with power achieving the minimax distinguishable rate up to a logarithmic factor. Main effects achieve the optimal univariate rates, and interactions achieve optimal rates up to logarithmic factors. The theoretical foundation relies on an orthogonality decomposition of effect subspaces, which enables the extension of the functional Bahadur representation framework to effect-wise inference in smoothing spline ANOVA with interactions. Simulation studies and real-data application to the Colorado temperature dataset demonstrate superior performance compared to existing methods.
- [5] arXiv:2602.02759 [pdf, ps, other]
-
Title: Near-Universal Multiplicative Updates for Nonnegative Einsum FactorizationComments: 26 pages, 5 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Despite the ubiquity of multiway data across scientific domains, there are few user-friendly tools that fit tailored nonnegative tensor factorizations. Researchers may use gradient-based automatic differentiation (which often struggles in nonnegative settings), choose between a limited set of methods with mature implementations, or implement their own model from scratch. As an alternative, we introduce NNEinFact, an einsum-based multiplicative update algorithm that fits any nonnegative tensor factorization expressible as a tensor contraction by minimizing one of many user-specified loss functions (including the $(\alpha,\beta)$-divergence). To use NNEinFact, the researcher simply specifies their model with a string. NNEinFact converges to a local minimum of the loss, supports missing data, and fits to tensors with hundreds of millions of entries in seconds. Empirically, NNEinFact fits custom models which outperform standard ones in heldout prediction tasks on real-world tensor data by over $37\%$ and attains less than half the test loss of gradient-based methods while converging up to 90 times faster.
- [6] arXiv:2602.02771 [pdf, ps, other]
-
Title: Markov Random Fields: Structural Properties, Phase Transition, and Response Function AnalysisSubjects: Methodology (stat.ME)
This paper presents a focused review of Markov random fields (MRFs)--commonly used probabilistic representations of spatial dependence in discrete spatial domains--for categorical data, with an emphasis on models for binary-valued observations or latent variables. We examine core structural properties of these models, including clique factorization, conditional independence, and the role of neighborhood structures. We also discuss the phenomenon of phase transition and its implications for statistical model specification and inference. A central contribution of this review is the use of response functions, a unifying tool we introduce for prior analysis that provides insight into how different formulations of MRFs influence implied marginal and joint distributions. We illustrate these concepts through a case study of direct-data MRF models with covariates, highlighting how different formulations encode dependence. While our focus is on binary fields, the principles outlined here extend naturally to more complex categorical MRFs and we draw connections to these higher-dimensional modeling scenarios. This review provides both theoretical grounding and practical tools for interpreting and extending MRF-based models.
- [7] arXiv:2602.02777 [pdf, ps, other]
-
Title: Disentangling spatial interference and spatial confounding biases in causal inferenceSubjects: Methodology (stat.ME)
Spatial interference and spatial confounding are two major issues inhibiting precise causal estimates when dealing with observational spatial data. Moreover, the definition and interpretation of spatial confounding remain arguable in the literature. In this paper, our goal is to provide clarity in a novel way on misconception and issues around spatial confounding from Directed Acyclic Graph (DAG) perspective and to disentangle both direct, indirect spatial confounding and spatial interference based on bias induced on causal estimates. Also, existing analyses of spatial confounding bias typically rely on Normality assumptions for treatments and confounders, assumptions that are often violated in practice. Relaxing these assumptions, we derive analytical expressions for spatial confounding bias under more general distributional settings using Poisson as example . We showed that the choice of spatial weights, the distribution of the treatment, and the magnitude of interference critically determine the extent of bias due to spatial interference. We further demonstrate that direct and indirect spatial confounding can be disentangled, with both the weight matrix and the nature of exposure playing central roles in determining the magnitude of indirect bias. Theoretical results are supported by simulation studies and an application to real-world spatial data. In future, parametric frameworks for concomitantly adjusting for spatial interference, direct and indirect spatial confounding for both direct and mediated effects estimation will be developed.
- [8] arXiv:2602.02791 [pdf, ps, other]
-
Title: Plug-In Classification of Drift Functions in Diffusion Processes Using Neural NetworksSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We study a supervised multiclass classification problem for diffusion processes, where each class is characterized by a distinct drift function and trajectories are observed at discrete times. Extending the one-dimensional multiclass framework of Denis et al. (2024) to multidimensional diffusions, we propose a neural network-based plug-in classifier that estimates the drift functions for each class from independent sample paths and assigns labels based on a Bayes-type decision rule. Under standard regularity assumptions, we establish convergence rates for the excess misclassification risk, explicitly capturing the effects of drift estimation error and time discretization. Numerical experiments demonstrate that the proposed method achieves faster convergence and improved classification performance compared to Denis et al. (2024) in the one-dimensional setting, remains effective in higher dimensions when the underlying drift functions admit a compositional structure, and consistently outperforms direct neural network classifiers trained end-to-end on trajectories without exploiting the diffusion model structure.
- [9] arXiv:2602.02800 [pdf, ps, other]
-
Title: Decision-Focused Optimal TransportSubjects: Statistics Theory (math.ST)
We propose a fundamental metric for measuring the distance between two distributions. This metric, referred to as the decision-focused (DF) divergence, is tailored to stochastic linear optimization problems in which the objective coefficients are random and may follow two distinct distributions. Traditional metrics such as KL divergence and Wasserstein distance are not well-suited for quantifying the resulting cost discrepancy, because changes in the coefficient distribution do not necessarily change the optimizer of the underlying linear program. Instead, the impact on the objective value depends on how the two distributions are coupled (aligned). Motivated by optimal transport, we introduce decision-focused distances under several settings, including the optimistic DF distance, the robust DF distance, and their entropy-regularized variants. We establish connections between the proposed DF distance and classical distributional metrics. For the calculation of the DF distance, we develop efficient computational methods. We further derive sample complexity guarantees for estimating these distances and show that the DF distance estimation avoids the curse of dimensionality that arises in Wasserstein distance estimation. The proposed DF distance provides a foundation for a broad range of applications. As an illustrative example, we study the interpolation between two distributions. Numerical studies, including a toy newsvendor problem and a real-world medical testing dataset, demonstrate the practical value of the proposed DF distance.
- [10] arXiv:2602.02806 [pdf, ps, other]
-
Title: De-Linearizing Agent Traces: Bayesian Inference of Latent Partial Orders for Efficient ExecutionSubjects: Applications (stat.AP)
I agents increasingly execute procedural workflows as sequential action traces, which obscures latent concurrency and induces repeated step-by-step reasoning. We introduce BPOP, a Bayesianframework that infers a latent dependency partial order from noisy linearized traces. BPOP models traces as stochastic linear extensions of an underlying graph and performs efficient MCMC inference via a tractable frontier-softmax likelihood that avoids #P-hard marginalization over linear extensions. We evaluate on our open-sourced Cloud-IaC-6, a suite of cloud provisioning tasks with heterogeneous LLM-generated traces, and WFCommons scientific workflows. BPOP recover dependency structure more accurately than trace-only and process-mining baselines, and the inferred graphs support a compiled executor that prunes irrelevant context, yielding substantial reductions in token usage and execution time.
- [11] arXiv:2602.02809 [pdf, ps, other]
-
Title: A Model-Robust G-Computation Method for Analyzing Hybrid Control Studies Without Assuming ExchangeabilitySubjects: Methodology (stat.ME)
There is growing interest in a hybrid control design for treatment evaluation, where a randomized controlled trial is augmented with external control data from a previous trial or a real world data source. The hybrid control design has the potential to improve efficiency but also carries the risk of introducing bias. The potential bias in a hybrid control study can be mitigated by adjusting for baseline covariates that are related to the control outcome. Existing methods that serve this purpose commonly assume that the internal and external control outcomes are exchangeable upon conditioning on a set of measured covariates. Possible violations of the exchangeability assumption can be addressed using a g-computation method with variable selection under a correctly specified outcome regression model. In this article, we note that a particular version of this g-computation method is protected against misspecification of the outcome regression model. This observation leads to a model-robust g-computation method that is remarkably simple and easy to implement, consistent and asymptotically normal under minimal assumptions, and able to improve efficiency by exploiting similarities between the internal and external control groups. The method is evaluated in a simulation study and illustrated using real data from HIV treatment trials.
- [12] arXiv:2602.02813 [pdf, ps, other]
-
Title: Downscaling land surface temperature data using edge detection and block-diagonal Gaussian process regressionAuthors: Sanjit Dandapanthula, Margaret Johnson, Madeleine Pascolini-Campbell, Glynn Hulley, Mikael KuuselaSubjects: Applications (stat.AP); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
Accurate and high-resolution estimation of land surface temperature (LST) is crucial in estimating evapotranspiration, a measure of plant water use and a central quantity in agricultural applications. In this work, we develop a novel statistical method for downscaling LST data obtained from NASA's ECOSTRESS mission, using high-resolution data from the Landsat 8 mission as a proxy for modeling agricultural field structure. Using the Landsat data, we identify the boundaries of agricultural fields through edge detection techniques, allowing us to capture the inherent block structure present in the spatial domain. We propose a block-diagonal Gaussian process (BDGP) model that captures the spatial structure of the agricultural fields, leverages independence of LST across fields for computational tractability, and accounts for the change of support present in ECOSTRESS observations. We use the resulting BDGP model to perform Gaussian process regression and obtain high-resolution estimates of LST from ECOSTRESS data, along with uncertainty quantification. Our results demonstrate the practicality of the proposed method in producing reliable high-resolution LST estimates, with potential applications in agriculture, urban planning, and climate studies.
- [13] arXiv:2602.02825 [pdf, ps, other]
-
Title: On the consistent and scalable detection of spatial patternsSubjects: Applications (stat.AP); Quantitative Methods (q-bio.QM); Methodology (stat.ME)
Detecting spatial patterns is fundamental to scientific discovery, yet current methods lack statistical consensus and face computational barriers when applied to large-scale spatial omics datasets. We unify major approaches through a single quadratic form and derive general consistency conditions. We reveal that several widely used methods, including Moran's I, are inconsistent, and propose scalable corrections. The resulting test enables robust pattern detection across millions of spatial locations and single-cell lineage-tracing datasets.
- [14] arXiv:2602.02860 [pdf, ps, other]
-
Title: Functional regression with multivariate responsesSubjects: Methodology (stat.ME)
We consider the functional regression model with multivariate response and functional predictors. Compared to fitting each individual response variable separately, taking advantage of the correlation between the response variables can improve the estimation and prediction accuracy. Using information in both functional predictors and multivariate response, we identify the optimal decomposition of the coefficient functions for prediction in population level. Then we propose methods to estimate this decomposition and fit the regression model for the situations of a small and a large number $p$ of functional predictors separately. For a large $p$, we propose a simultaneous smooth-sparse penalty which can both make curve selection and improve estimation and prediction accuracy. We provide the asymptotic results when both the sample size and the number of functional predictors go to infinity. Our method can be applied to models with thousands of functional predictors and has been implemented in the R package FRegSigCom.
- [15] arXiv:2602.02874 [pdf, ps, other]
-
Title: Ten simple rules for teaching data scienceSubjects: Other Statistics (stat.OT)
Teaching data science presents unique challenges and opportunities that cannot be fully addressed by simply borrowing pedagogical strategies from its parent disciplines of statistics and computer science. Here, we present ten simple rules for teaching data science, developed and refined by leading educators in the community and successfully applied in our own data science classrooms.
- [16] arXiv:2602.02875 [pdf, ps, other]
-
Title: Shiha Distribution: Statistical Properties and Applications to Reliability Engineering and Environmental DataAuthors: F. A. ShihaSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
This paper introduces a new two-parameter distribution, referred to as the Shiha distribution, which provides a flexible model for skewed lifetime data with either heavy or light tails. The proposed distribution is applicable to various fields, including reliability engineering, environmental studies, and related areas. We derive its main statistical properties, including the moment generating function, moments, hazard rate function, quantile function, and entropy. The stress--strength reliability parameter is also derived in closed form. A simulation study is conducted to evaluate its performance. Applications to several real data sets demonstrate that the Shiha distribution consistently provides a superior fit compared with established competing models, confirming its practical effectiveness for lifetime data analysis.
- [17] arXiv:2602.02887 [pdf, ps, other]
-
Title: From Accessibility to Allocation: An Integrated Workflow for Land-Use Assignment and FAR EstimationSubjects: Computation (stat.CO)
Urban land use and building intensity are often planned without a direct, auditable link to network accessibility, limiting ex-ante policy evaluation. This study asks whether multi-radius street centralities can be elevated from diagnosis to design lever to allocate land use and floor area in a transparent, optimization-ready workflow. We introduce a three-stage pipeline that connects configuration to program and intensity. First, multi-radius accessibility is computed on the street network and translated to blocks to provide scale-legible measures of reach. Second, these measures structure nested service basins that guide a rule-based placement of land uses with explicit priorities and minimum parcel footprints, ensuring reproducibility. Third, within each use, floor-area ratio (FAR) is assigned by an accessibility-weighted linear model that satisfies global construction totals while anchoring the average FAR, thereby tilting height toward better-connected blocks without pathological extremes. The framework supports multi-objective policy search via sampling and Pareto screening. Applied to a real urban district, the workflow reproduces corridor-biased commercial siting and industrial belts while concentrating intensity on highly connected blocks. Policy sampling via multi-objective screening yields Pareto-efficient plans that reconcile accessibility gains with deviations from target land-share and construction-share structures. The contribution is twofold: methodologically, it translates familiar space-syntax measures into cluster-aware, rule-governed land-use and FAR assignment with explicit guarantees (scale-legible radii, parcel minima, and an average-FAR anchor). Practically, it offers planners a transparent instrument for counterfactual testing and negotiated trade-offs at neighborhood/district/city scales.
- [18] arXiv:2602.02927 [pdf, ps, other]
-
Title: Training-Free Self-Correction for Multimodal Masked Diffusion ModelsAuthors: Yidong Ouyang, Panwen Hu, Zhengyan Wan, Zhe Wang, Liyan Xie, Dmitriy Bespalov, Ying Nian Wu, Guang Cheng, Hongyuan Zha, Qiang SunSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Masked diffusion models have emerged as a powerful framework for text and multimodal generation. However, their sampling procedure updates multiple tokens simultaneously and treats generated tokens as immutable, which may lead to error accumulation when early mistakes cannot be revised. In this work, we revisit existing self-correction methods and identify limitations stemming from additional training requirements or reliance on misaligned likelihood estimates. We propose a training-free self-correction framework that exploits the inductive biases of pre-trained masked diffusion models. Without modifying model parameters or introducing auxiliary evaluators, our method significantly improves generation quality on text-to-image generation and multimodal understanding tasks with reduced sampling steps. Moreover, the proposed framework generalizes across different masked diffusion architectures, highlighting its robustness and practical applicability. Code can be found in https://github.com/huge123/FreeCorrection.
- [19] arXiv:2602.02931 [pdf, ps, other]
-
Title: Weighted Sum-of-Trees Model for Clustered DataComments: 14 pages, 8 figures, 3 tablesSubjects: Methodology (stat.ME); Machine Learning (cs.LG)
Clustered data, which arise when observations are nested within groups, are incredibly common in clinical, education, and social science research. Traditionally, a linear mixed model, which includes random effects to account for within-group correlation, would be used to model the observed data and make new predictions on unseen data. Some work has been done to extend the mixed model approach beyond linear regression into more complex and non-parametric models, such as decision trees and random forests. However, existing methods are limited to using the global fixed effects for prediction on data from out-of-sample groups, effectively assuming that all clusters share a common outcome model. We propose a lightweight sum-of-trees model in which we learn a decision tree for each sample group. We combine the predictions from these trees using weights so that out-of-sample group predictions are more closely aligned with the most similar groups in the training data. This strategy also allows for inference on the similarity across groups in the outcome prediction model, as the unique tree structures and variable importances for each group can be directly compared. We show our model outperforms traditional decision trees and random forests in a variety of simulation settings. Finally, we showcase our method on real-world data from the sarcoma cohort of The Cancer Genome Atlas, where patient samples are grouped by sarcoma subtype.
- [20] arXiv:2602.02945 [pdf, ps, other]
-
Title: Bayesian Methods for the Navier-Stokes EquationsSubjects: Computation (stat.CO); Numerical Analysis (math.NA)
We develop a Bayesian methodology for numerical solution of the incompressible Navier--Stokes equations with quantified uncertainty. The central idea is to treat discretized Navier--Stokes dynamics as a state-space model and to view numerical solution as posterior computation: priors encode physical structure and modeling error, and the solver outputs a distribution over states and quantities of interest rather than a single trajectory. In two dimensions, stochastic representations (Feynman--Kac and stochastic characteristics for linear advection--diffusion with prescribed drift) motivate Monte Carlo solvers and provide intuition for uncertainty propagation. In three dimensions, we formulate stochastic Navier--Stokes models and describe particle-based and ensemble-based Bayesian workflows for uncertainty propagation in spectral discretizations. A key computational advantage is that parameter learning can be performed stably via particle learning: marginalization and resample--propagate (one-step smoothing) constructions avoid the weight-collapse that plagues naive sequential importance sampling on static parameters. When partial observations are available, the same machinery supports sequential observational updating as an additional capability. We also discuss non-Gaussian (heavy-tailed) error models based on normal variance-mean mixtures, which yield conditionally Gaussian updates via latent scale augmentation.
- [21] arXiv:2602.03049 [pdf, ps, other]
-
Title: Unified Inference Framework for Single and Multi-Player Performative Prediction: Method and Asymptotic OptimalitySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
Performative prediction characterizes environments where predictive models alter the very data distributions they aim to forecast, triggering complex feedback loops. While prior research treats single-agent and multi-agent performativity as distinct phenomena, this paper introduces a unified statistical inference framework that bridges these contexts, treating the former as a special case of the latter. Our contribution is two-fold. First, we put forward the Repeated Risk Minimization (RRM) procedure for estimating the performative stability, and establish a rigorous inferential theory for admitting its asymptotic normality and confirming its asymptotic efficiency. Second, for the performative optimality, we introduce a novel two-step plug-in estimator that integrates the idea of Recalibrated Prediction Powered Inference (RePPI) with Importance Sampling, and further provide formal derivations for the Central Limit Theorems of both the underlying distributional parameters and the plug-in results. The theoretical analysis demonstrates that our estimator achieves the semiparametric efficiency bound and maintains robustness under mild distributional misspecification. This work provides a principled toolkit for reliable estimation and decision-making in dynamic, performative environments.
- [22] arXiv:2602.03077 [pdf, ps, other]
-
Title: Empirical Bayes Shrinkage of Functional Effects, with Application to Analysis of Dynamic eQTLsSubjects: Methodology (stat.ME); Applications (stat.AP)
We introduce functional adaptive shrinkage (FASH), an empirical Bayes method for joint analysis of observation units in which each unit estimates an effect function at several values of a continuous condition variable. The ideas in this paper are motivated by dynamic expression quantitative trait locus (eQTL) studies, which aim to characterize how genetic effects on gene expression vary with time or another continuous condition. FASH integrates a broad family of Gaussian processes defined through linear differential operators into an empirical Bayes shrinkage framework, enabling adaptive smoothing and borrowing of information across units. This provides improved estimation of effect functions and principled hypothesis testing, allowing straightforward computation of significance measures such as local false discovery and false sign rates. To encourage conservative inferences, we propose a simple prior- adjustment method that has theoretical guarantees and can be more broadly used with other empirical Bayes methods. We illustrate the benefits of FASH by reanalyzing dynamic eQTL data on cardiomyocyte differentiation from induced pluripotent stem cells. FASH identified novel dynamic eQTLs, revealed diverse temporal effect patterns, and provided improved power compared with the original analysis. More broadly, FASH offers a flexible statistical framework for joint analysis of functional data, with applications extending beyond genomics. To facilitate use of FASH in dynamic eQTL studies and other settings, we provide an accompanying R package at https: //github.com/stephenslab/fashr.
- [23] arXiv:2602.03165 [pdf, other]
-
Title: Entropic Mirror Monte CarloAuthors: Anas Cherradi (LPSM (UMR\_8001), SU), Yazid Janati, Alain Durmus (CMAP), Sylvain Le Corff (LPSM (UMR\_8001), SU), Yohan Petetin, Julien Stoehr (CEREMADE)Subjects: Methodology (stat.ME); Machine Learning (stat.ML)
Importance sampling is a Monte Carlo method which designs estimators of expectations under a target distribution using weighted samples from a proposal distribution. When the target distribution is complex, such as multimodal distributions in highdimensional spaces, the efficiency of importance sampling critically depends on the choice of the proposal distribution. In this paper, we propose a novel adaptive scheme for the construction of efficient proposal distributions. Our algorithm promotes efficient exploration of the target distribution by combining global sampling mechanisms with a delayed weighting procedure. The proposed weighting mechanism plays a key role by enabling rapid resampling in regions where the proposal distribution is poorly adapted to the target. Our sampling algorithm is shown to be geometrically convergent under mild assumptions and is illustrated through various numerical experiments.
- [24] arXiv:2602.03168 [pdf, ps, other]
-
Title: Online Conformal Prediction via Universal Portfolio AlgorithmsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Online conformal prediction (OCP) seeks prediction intervals that achieve long-run $1-\alpha$ coverage for arbitrary (possibly adversarial) data streams, while remaining as informative as possible. Existing OCP methods often require manual learning-rate tuning to work well, and may also require algorithm-specific analyses. Here, we develop a general regret-to-coverage theory for interval-valued OCP based on the $(1-\alpha)$-pinball loss. Our first contribution is to identify \emph{linearized regret} as a key notion, showing that controlling it implies coverage bounds for any online algorithm. This relies on a black-box reduction that depends only on the Fenchel conjugate of an upper bound on the linearized regret. Building on this theory, we propose UP-OCP, a parameter-free method for OCP, via a reduction to a two-asset portfolio selection problem, leveraging universal portfolio algorithms. We show strong finite-time bounds on the miscoverage of UP-OCP, even for polynomially growing predictions. Extensive experiments support that UP-OCP delivers consistently better size/coverage trade-offs than prior online conformal baselines.
- [25] arXiv:2602.03169 [pdf, ps, other]
-
Title: NeuralFLoC: Neural Flow-Based Joint Registration and Clustering of Functional DataSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Clustering functional data in the presence of phase variation is challenging, as temporal misalignment can obscure intrinsic shape differences and degrade clustering performance. Most existing approaches treat registration and clustering as separate tasks or rely on restrictive parametric assumptions. We present \textbf{NeuralFLoC}, a fully unsupervised, end-to-end deep learning framework for joint functional registration and clustering based on Neural ODE-driven diffeomorphic flows and spectral clustering. The proposed model learns smooth, invertible warping functions and cluster-specific templates simultaneously, effectively disentangling phase and amplitude variation. We establish universal approximation guarantees and asymptotic consistency for the proposed framework. Experiments on functional benchmarks show state-of-the-art performance in both registration and clustering, with robustness to missing data, irregular sampling, and noise, while maintaining scalability. Code is available at https://anonymous.4open.science/r/NeuralFLoC-FEC8.
- [26] arXiv:2602.03202 [pdf, ps, other]
-
Title: Sharp Inequalities between Total Variation and Hellinger Distances for Gaussian MixturesComments: 34 pagesSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
We study the relation between the total variation (TV) and Hellinger distances between two Gaussian location mixtures. Our first result establishes a general upper bound: for any two mixing distributions supported on a compact set, the Hellinger distance between the two mixtures is controlled by the TV distance raised to a power $1-o(1)$, where the $o(1)$ term is of order $1/\log\log(1/\mathrm{TV})$. We also construct two sequences of mixing distributions that demonstrate the sharpness of this bound. Taken together, our results resolve an open problem raised in Jia et al. (2023) and thus lead to an entropic characterization of learning Gaussian mixtures in total variation. Our inequality also yields optimal robust estimation of Gaussian mixtures in Hellinger distance, which has a direct implication for bounding the minimax regret of empirical Bayes under Huber contamination.
- [27] arXiv:2602.03215 [pdf, ps, other]
-
Title: Latent Neural-ODE for Model-Informed Precision Dosing: Overcoming Structural Assumptions in PharmacokineticsSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Accurate estimation of tacrolimus exposure, quantified by the area under the concentration-time curve (AUC), is essential for precision dosing after renal transplantation. Current practice relies on population pharmacokinetic (PopPK) models based on nonlinear mixed-effects (NLME) methods. However, these models depend on rigid, pre-specified assumptions and may struggle to capture complex, patient-specific dynamics, leading to model misspecification.
In this study, we introduce a novel data-driven alternative based on Latent Ordinary Differential Equations (Latent ODEs) for tacrolimus AUC prediction. This deep learning approach learns individualized pharmacokinetic dynamics directly from sparse clinical data, enabling greater flexibility in modeling complex biological behavior. The model was evaluated through extensive simulations across multiple scenarios and benchmarked against two standard approaches: NLME-based estimation and the iterative two-stage Bayesian (it2B) method. We further performed a rigorous clinical validation using a development dataset (n = 178) and a completely independent external dataset (n = 75).
In simulation, the Latent ODE model demonstrated superior robustness, maintaining high accuracy even when underlying biological mechanisms deviated from standard assumptions. Regarding experiments on clinical datasets, in internal validation, it achieved significantly higher precision with a mean RMSPE of 7.99% compared with 9.24% for it2B (p < 0.001). On the external cohort, it achieved an RMSPE of 10.82%, comparable to the two standard estimators (11.48% and 11.54%).
These results establish the Latent ODE as a powerful and reliable tool for AUC prediction. Its flexible architecture provides a promising foundation for next-generation, multi-modal models in personalized medicine. - [28] arXiv:2602.03218 [pdf, ps, other]
-
Title: Blinded sample size re-estimation accounting for uncertainty in mid-trial estimationSubjects: Methodology (stat.ME); Applications (stat.AP)
For randomized controlled trials to be conclusive, it is important to set the target sample size accurately at the design stage. Comparing two normal populations, the sample size calculation requires specification of the variance other than the treatment effect and misspecification can lead to underpowered studies. Blinded sample size re-estimation is an approach to minimize the risk of inconclusive studies. Existing methods proposed to use the total (one-sample) variance that is estimable from blinded data without knowledge of the treatment allocation. We demonstrate that, since the expectation of this estimator is greater than or equal to the true variance, the one-sample variance approach can be regarded as providing an upper bound of the variance in blind reviews. This worst-case evaluation can likely reduce a risk of underpowered studies. However, blinded reviews of small sample size may still lead to underpowered studies. We propose a refined method accounting for estimation error in blind reviews using an upper confidence limit of the variance. A similar idea had been proposed in the setting of external pilot studies. Furthermore, we developed a method to select an appropriate confidence level so that the re-estimated sample size attains the target power. Numerical studies showed that our method works well and outperforms existing methods. The proposed procedure is motivated and illustrated by recent randomized clinical trials.
- [29] arXiv:2602.03258 [pdf, ps, other]
-
Title: Principled Federated Random Forests for Heterogeneous DataSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Random Forests (RF) are among the most powerful and widely used predictive models for centralized tabular data, yet few methods exist to adapt them to the federated learning setting. Unlike most federated learning approaches, the piecewise-constant nature of RF prevents exact gradient-based optimization. As a result, existing federated RF implementations rely on unprincipled heuristics: for instance, aggregating decision trees trained independently on clients fails to optimize the global impurity criterion, even under simple distribution shifts. We propose FedForest, a new federated RF algorithm for horizontally partitioned data that naturally accommodates diverse forms of client data heterogeneity, from covariate shift to more complex outcome shift mechanisms. We prove that our splitting procedure, based on aggregating carefully chosen client statistics, closely approximates the split selected by a centralized algorithm. Moreover, FedForest allows splits on client indicators, enabling a non-parametric form of personalization that is absent from prior federated random forest methods. Empirically, we demonstrate that the resulting federated forests closely match centralized performance across heterogeneous benchmarks while remaining communication-efficient.
- [30] arXiv:2602.03274 [pdf, ps, other]
-
Title: Six-Minute Man Sander Eitrem 5:58.52 -- first man below the 6:00.00 barrierAuthors: Nils Lid HjortSubjects: Other Statistics (stat.OT); Physics and Society (physics.soc-ph)
In Calgary, November 2005, Chad Hedrick was the first to skate the 5,000 m below 6:10. His world record time 6:09.68 was then beaten a week later, in Salt Lake City, by Sven Kramer's 6:08.78. Further top races and world records followed over the ensuing seasons; up to and including the 2024-2025 season, a total of 126 races have been below 6:10, with Nils van der Poel's 2021 world record being 6:01.56. The appropriately hyped-up canonical question for the friends and followers and aficionados of speedskating has then been when (and by whom we for the first time would witness a below 6:00.00 race. In this note I first use extreme value statistics modelling to assess the state of affairs, as per the end of the 2024-2025 season, with predictions and probabilities for the 2025-2026 season. Under natural modelling assumptions the probability of seeing a new world record during this new season is shown to be about ten percent. We were indeed excited but in reality merely modestly surprised that a race better than van der Poel's record was clocked, by Timothy Loubineaud, in Salt Lake City, November 14, 2025. But Six-Minute Man Sander Eitrem's outstanding 5:58.52 in Inzell, on January 24, 2026, is truly beamonesquely shocking. I also use the modelling machinery to analyse the post-Eitrem situation, and suggest answers to the question of how fast the 5,000 m ever can be skated.
- [31] arXiv:2602.03283 [pdf, ps, other]
-
Title: Orthogonal Approximate Message Passing Algorithms for Rectangular Spiked Matrix Models with Rotationally Invariant NoiseComments: To appear in the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026Subjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (stat.ML)
We propose an orthogonal approximate message passing (OAMP) algorithm for signal estimation in the rectangular spiked matrix model with general rotationally invariant (RI) noise. We establish a rigorous state evolution that exactly characterizes the high-dimensional dynamics of the algorithm. Building on this framework, we derive an optimal variant of OAMP that minimizes the predicted mean-squared error at each iteration. For the special case of i.i.d. Gaussian noise, the fixed point of the proposed OAMP algorithm coincides with that of the standard AMP algorithm. For general RI noise models, we conjecture that the optimal OAMP algorithm is statistically optimal within a broad class of iterative methods, and achieves Bayes-optimal performance in certain regimes.
- [32] arXiv:2602.03317 [pdf, ps, other]
-
Title: Multiparameter Uncertainty Mapping in Quantitative Molecular MRI using a Physics-Structured Variational Autoencoder (PS-VAE)Authors: Alex Finkelstein, Ron Moneta, Or Zohar, Michal Rivlin, Moritz Zaiss, Dinora Friedmann Morvinski, Or PerlmanComments: Submitted to IEEE Transactions on Medical Imaging. This project was funded by the European Union (ERC, BabyMagnet, project no. 101115639). Views and opinions expressed are, however, those of the authors only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for themSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
Quantitative imaging methods, such as magnetic resonance fingerprinting (MRF), aim to extract interpretable pathology biomarkers by estimating biophysical tissue parameters from signal evolutions. However, the pattern-matching algorithms or neural networks used in such inverse problems often lack principled uncertainty quantification, which limits the trustworthiness and transparency, required for clinical acceptance. Here, we describe a physics-structured variational autoencoder (PS-VAE) designed for rapid extraction of voxelwise multi-parameter posterior distributions. Our approach integrates a differentiable spin physics simulator with self-supervised learning, and provides a full covariance that captures the inter-parameter correlations of the latent biophysical space. The method was validated in a multi-proton pool chemical exchange saturation transfer (CEST) and semisolid magnetization transfer (MT) molecular MRF study, across in-vitro phantoms, tumor-bearing mice, healthy human volunteers, and a subject with glioblastoma. The resulting multi-parametric posteriors are in good agreement with those calculated using a brute-force Bayesian analysis, while providing an orders-of-magnitude acceleration in whole brain quantification. In addition, we demonstrate how monitoring the multi-parameter posterior dynamics across progressively acquired signals provides practical insights for protocol optimization and may facilitate real-time adaptive acquisition.
- [33] arXiv:2602.03343 [pdf, ps, other]
-
Title: MARADONER: Motif Activity Response Analysis Done RightSubjects: Computation (stat.CO); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM)
Inferring the activities of transcription factors from high-throughput transcriptomic or open chromatin profiling, such as RNA-/CAGE-/ATAC-Seq, is a long-standing challenge in systems biology. Identification of highly active master regulators enables mechanistic interpretation of differential gene expression, chromatin state changes, or perturbation responses across conditions, cell types, and diseases. Here, we describe MARADONER, a statistical framework and its software implementation for motif activity response analysis (MARA), utilizing the sequence-level features obtained with pattern matching (motif scanning) of individual promoters and promoter- or gene-level activity or expression estimates. Compared to the classic MARA, MARADONER (MARA-done-right) employs an unbiased variance parameter estimation and a bias-adjusted likelihood estimation of fixed effects, thereby enhancing goodness-of-fit and the accuracy of activity estimation. Further, MARADONER is capable of accounting for heteroscedasticity of motif scores and activity estimates.
- [34] arXiv:2602.03394 [pdf, ps, other]
-
Title: Improving the Linearized Laplace Approximation via Quadratic ApproximationsComments: 6 pages, 1 table. Accepted at European Symposium on Artificial Neural Networks (ESANN 2026) as poster presentationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Deep neural networks (DNNs) often produce overconfident out-of-distribution predictions, motivating Bayesian uncertainty quantification. The Linearized Laplace Approximation (LLA) achieves this by linearizing the DNN and applying Laplace inference to the resulting model. Importantly, the linear model is also used for prediction. We argue this linearization in the posterior may degrade fidelity to the true Laplace approximation. To alleviate this problem, without increasing significantly the computational cost, we propose the Quadratic Laplace Approximation (QLA). QLA approximates each second order factor in the approximate Laplace log-posterior using a rank-one factor obtained via efficient power iterations. QLA is expected to yield a posterior precision closer to that of the full Laplace without forming the full Hessian, which is typically intractable. For prediction, QLA also uses the linearized model. Empirically, QLA yields modest yet consistent uncertainty estimation improvements over LLA on five regression datasets.
- [35] arXiv:2602.03413 [pdf, ps, other]
-
Title: On the Convergence of Wasserstein Gradient Descent for SamplingSubjects: Computation (stat.CO)
This paper studies the optimization of the KL functional on the Wasserstein space of probability measures, and develops a sampling framework based on Wasserstein gradient descent (WGD). We identify two important subclasses of the Wasserstein space for which the WGD scheme is guaranteed to converge, thereby providing new theoretical foundations for optimization-based sampling methods on measure spaces. For practical implementation, we construct a particle-based WGD algorithm in which the score function is estimated via score matching. Through a series of numerical experiments, we demonstrate that WGD can provide good approximation to a variety of complex target distributions, including those that pose substantial challenges for standard MCMC and parametric variational Bayes methods. These results suggest that WGD offers a promising and flexible alternative for scalable Bayesian inference in high-dimensional or multimodal settings.
- [36] arXiv:2602.03449 [pdf, ps, other]
-
Title: Score-based diffusion models for diffuse optical tomography with uncertainty quantificationAuthors: Fabian Schneider, Meghdoot Mozumder, Konstantin Tamarov, Leila Taghizadeh, Tanja Tarvainen, Tapio Helin, Duc-Lam DuongSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
Score-based diffusion models are a recently developed framework for posterior sampling in Bayesian inverse problems with a state-of-the-art performance for severely ill-posed problems by leveraging a powerful prior distribution learned from empirical data. Despite generating significant interest especially in the machine-learning community, a thorough study of realistic inverse problems in the presence of modelling error and utilization of physical measurement data is still outstanding. In this work, the framework of unconditional representation for the conditional score function (UCoS) is evaluated for linearized difference imaging in diffuse optical tomography (DOT). DOT uses boundary measurements of near-infrared light to estimate the spatial distribution of absorption and scattering parameters in biological tissues. The problem is highly ill-posed and thus sensitive to noise and modelling errors. We introduce a novel regularization approach that prevents overfitting of the score function by constructing a mixed score composed of a learned and a model-based component. Validation of this approach is done using both simulated and experimental measurement data. The experiments demonstrate that a data-driven prior distribution results in posterior samples with low variance, compared to classical model-based estimation, and centred around the ground truth, even in the context of a highly ill-posed problem and in the presence of modelling errors.
- [37] arXiv:2602.03483 [pdf, ps, other]
-
Title: Kriging for large datasets via penalized neighbor selectionComments: Submitted for Journal publicationSubjects: Methodology (stat.ME); Computation (stat.CO)
Kriging is a fundamental tool for spatial prediction, but its computational complexity of $O(N^3)$ becomes prohibitive for large datasets. While local kriging using $K$-nearest neighbors addresses this issue, the selection of $K$ typically relies on ad-hoc criteria that fail to account for spatial correlation structure. We propose a penalized kriging framework that incorporates LASSO-type penalties directly into the kriging equations to achieve automatic, data-driven neighbor selection. We further extend this to adaptive LASSO, using data-driven penalty weights that account for the spatial correlation structure. Our method determines which observations contribute non-zero weights through $\ell_1$ regularization, with the penalty parameter selected via a novel criterion based on effective sample size that balances prediction accuracy against information redundancy. Numerical experiments demonstrate that penalized kriging automatically adapts neighborhood structure to the underlying spatial correlation, selecting fewer neighbors for smoother processes and more for highly variable fields, while maintaining prediction accuracy comparable to global kriging at substantially reduced computational cost.
- [38] arXiv:2602.03539 [pdf, ps, other]
-
Title: Optimal neural network approximation of smooth compositional functions on sets with low intrinsic dimensionSubjects: Statistics Theory (math.ST)
We study approximation and statistical learning properties of deep ReLU networks under structural assumptions that mitigate the curse of dimensionality. We prove minimax-optimal uniform approximation rates for $s$-H\"older smooth functions defined on sets with low Minkowski dimension using fully connected networks with flexible width and depth, improving existing results by logarithmic factors even in classical full-dimensional settings. A key technical ingredient is a new memorization result for deep ReLU networks that enables efficient point fitting with dense architectures. We further introduce a class of compositional models in which each component function is smooth and acts on a domain of low intrinsic dimension. This framework unifies two common assumptions in the statistical learning literature, structural constraints on the target function and low dimensionality of the covariates, within a single model. We show that deep networks can approximate such functions at rates determined by the most difficult function in the composition. As an application, we derive improved convergence rates for empirical risk minimization in nonparametric regression that adapt to smoothness, compositional structure, and intrinsic dimensionality.
- [39] arXiv:2602.03609 [pdf, ps, other]
-
Title: Scalable non-separable spatio-temporal Gaussian process models for large-scale short-term weather predictionSubjects: Applications (stat.AP)
Monitoring daily weather fields is critical for climate science, agriculture, and environmental planning, yet fully probabilistic spatio-temporal models become computationally prohibitive at continental scale. We present a case study on short-term forecasting of daily maximum temperature and precipitation across the conterminous United States using novel scalable spatio-temporal Gaussian process methodology. Building on three approximation families - inducing-point methods (FITC), Vecchia approximations, and a hybrid Vecchia-inducing-point full-scale approach (VIF) - we introduce three extensions that address key bottlenecks in large space-time settings: (i) a scalable correlation-based neighbor selection strategy for Vecchia approximations with point-referenced data, enabling accurate conditioning under complex dependence structures, (ii) a space-time kMeans++ inducing-point selection algorithm, and (iii) GPU-accelerated implementations of computationally expensive operations, including matrix operations and neighbor searches. Using both synthetic experiments and a large NOAA station dataset containing approximately 1.7 million space-time observations, we analyze the models with respect to predictive performance, parameter estimation, and computational efficiency. Our results demonstrate that scalable Gaussian process models can yield accurate continental-scale forecasts while remaining computationally feasible, offering practical tools for weather applications.
- [40] arXiv:2602.03612 [pdf, ps, other]
-
Title: Generator-based Graph Generation via Heat DiffusionComments: Submitted to ICML; 8+15 pages; 20 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Graph generative modelling has become an essential task due to the wide range of applications in chemistry, biology, social networks, and knowledge representation. In this work, we propose a novel framework for generating graphs by adapting the Generator Matching (arXiv:2410.20587) paradigm to graph-structured data. We leverage the graph Laplacian and its associated heat kernel to define a continous-time diffusion on each graph. The Laplacian serves as the infinitesimal generator of this diffusion, and its heat kernel provides a family of conditional perturbations of the initial graph. A neural network is trained to match this generator by minimising a Bregman divergence between the true generator and a learnable surrogate. Once trained, the surrogate generator is used to simulate a time-reversed diffusion process to sample new graph structures. Our framework unifies and generalises existing diffusion-based graph generative models, injecting domain-specific inductive bias via the Laplacian, while retaining the flexibility of neural approximators. Experimental studies demonstrate that our approach captures structural properties of real and synthetic graphs effectively.
- [41] arXiv:2602.03613 [pdf, ps, other]
-
Title: Simulation-Based Inference via Regression Projection and Batched DiscrepanciesComments: comments are welcome,Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
We analyze a lightweight simulation-based inference method that infers simulator parameters using only a regression-based projection of the observed data. After fitting a surrogate linear regression once, the procedure simulates small batches at the proposed parameter values and assigns kernel weights based on the resulting batch-residual discrepancy, producing a self-normalized pseudo-posterior that is simple, parallelizable, and requires access only to the fitted regression coefficients rather than raw observations. We formalize the construction as an importance-sampling approximation to a population target that averages over simulator randomness, prove consistency as the number of parameter draws grows, and establish stability in estimating the surrogate regression from finite samples. We then characterize the asymptotic concentration as the batch size increases and the bandwidth shrinks, showing that the pseudo-posterior concentrates on an identified set determined by the chosen projection, thereby clarifying when the method yields point versus set identification. Experiments on a tractable nonlinear model and on a cosmological calibration task using the DREAMS simulation suite illustrate the computational advantages of regression-based projections and the identifiability limitations arising from low-information summaries.
- [42] arXiv:2602.03682 [pdf, ps, other]
-
Title: Improved Analysis of the Accelerated Noisy Power Method with Applications to Decentralized PCASubjects: Machine Learning (stat.ML); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
We analyze the Accelerated Noisy Power Method, an algorithm for Principal Component Analysis in the setting where only inexact matrix-vector products are available, which can arise for instance in decentralized PCA. While previous works have established that acceleration can improve convergence rates compared to the standard Noisy Power Method, these guarantees require overly restrictive upper bounds on the magnitude of the perturbations, limiting their practical applicability. We provide an improved analysis of this algorithm, which preserves the accelerated convergence rate under much milder conditions on the perturbations. We show that our new analysis is worst-case optimal, in the sense that the convergence rate cannot be improved, and that the noise conditions we derive cannot be relaxed without sacrificing convergence guarantees. We demonstrate the practical relevance of our results by deriving an accelerated algorithm for decentralized PCA, which has similar communication costs to non-accelerated methods. To our knowledge, this is the first decentralized algorithm for PCA with provably accelerated convergence.
- [43] arXiv:2602.03730 [pdf, ps, other]
-
Title: Efficient Variance-reduced Estimation from Generative EHR Models: The SCOPE and REACH EstimatorsAuthors: Luke Solo, Matthew B.A. McDermott, William F. Parker, Bashar Ramadan, Michael C. Burkhart, Brett K. Beaulieu-JonesComments: 10 pages, 2 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Generative models trained using self-supervision of tokenized electronic health record (EHR) timelines show promise for clinical outcome prediction. This is typically done using Monte Carlo simulation for future patient trajectories. However, existing approaches suffer from three key limitations: sparse estimate distributions that poorly differentiate patient risk levels, extreme computational costs, and high sampling variance. We propose two new estimators: the Sum of Conditional Outcome Probability Estimator (SCOPE) and Risk Estimation from Anticipated Conditional Hazards (REACH), that leverage next-token probability distributions discarded by standard Monte Carlo. We prove both estimators are unbiased and that REACH guarantees variance reduction over Monte Carlo sampling for any model and outcome. Empirically, on hospital mortality prediction in MIMIC-IV using the ETHOS-ARES framework, SCOPE and REACH match 100-sample Monte Carlo performance using only 10-11 samples (95% CI: [9,11]), representing a ~10x reduction in inference cost without degrading calibration. For ICU admission prediction, efficiency gains are more modest (~1.2x), which we attribute to the outcome's lower "spontaneity," a property we characterize theoretically and empirically. These methods substantially improve the feasibility of deploying generative EHR models in resource-constrained clinical settings.
- [44] arXiv:2602.03756 [pdf, ps, other]
-
Title: Bayesian variable and hazard structure selection in the General Hazard modelSubjects: Methodology (stat.ME)
The proportional hazards (PH) and accelerated failure time (AFT) models are the most widely used hazard structures for analysing time-to-event data. When the goal is to identify variables associated with event times, variable selection is typically performed within a single hazard structure, imposing strong assumptions on how covariates affect the hazard function. To allow simultaneous selection of relevant variables and the hazard structure itself, we develop a Bayesian variable selection approach within the general hazard (GH) model, which includes the PH, AFT, and other structures as special cases. We propose two types of g-priors for the regression coefficients that enable tractable computation and show that both lead to consistent model selection. We also introduce a hierarchical prior on the model space that accounts for multiplicity and penalises model complexity. To efficiently explore the GH model space, we extend the Add-Delete-Swap algorithm to jointly sample variable inclusion indicators and hazard structures. Simulation studies show accurate recovery of both the true hazard structure and active variables across different sample sizes and censoring levels. Two real-data applications are presented to illustrate the use of the proposed methodology and to compare it with existing variable selection methods.
- [45] arXiv:2602.03789 [pdf, ps, other]
-
Title: Fast Sampling for Flows and Diffusions with Lazy and Point Mass Stochastic InterpolantsSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Stochastic interpolants unify flows and diffusions, popular generative modeling frameworks. A primary hyperparameter in these methods is the interpolation schedule that determines how to bridge a standard Gaussian base measure to an arbitrary target measure. We prove how to convert a sample path of a stochastic differential equation (SDE) with arbitrary diffusion coefficient under any schedule into the unique sample path under another arbitrary schedule and diffusion coefficient. We then extend the stochastic interpolant framework to admit a larger class of point mass schedules in which the Gaussian base measure collapses to a point mass measure. Under the assumption of Gaussian data, we identify lazy schedule families that make the drift identically zero and show that with deterministic sampling one gets a variance-preserving schedule commonly used in diffusion models, whereas with statistically optimal SDE sampling one gets our point mass schedule. Finally, to demonstrate the usefulness of our theoretical results on realistic highly non-Gaussian data, we apply our lazy schedule conversion to a state-of-the-art pretrained flow model and show that this allows for generating images in fewer steps without retraining the model.
- [46] arXiv:2602.03823 [pdf, ps, other]
-
Title: Preference-based Conditional Treatment Effects and Policy LearningComments: Accepted to AISTATS 2026; 10 pages + appendixSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We introduce a new preference-based framework for conditional treatment effect estimation and policy learning, built on the Conditional Preference-based Treatment Effect (CPTE). CPTE requires only that outcomes be ranked under a preference rule, unlocking flexible modeling of heterogeneous effects with multivariate, ordinal, or preference-driven outcomes. This unifies applications such as conditional probability of necessity and sufficiency, conditional Win Ratio, and Generalized Pairwise Comparisons. Despite the intrinsic non-identifiability of comparison-based estimands, CPTE provides interpretable targets and delivers new identifiability conditions for previous unidentifiable estimands. We present estimation strategies via matching, quantile, and distributional regression, and further design efficient influence-function estimators to correct plug-in bias and maximize policy value. Synthetic and semi-synthetic experiments demonstrate clear performance gains and practical impact.
Cross-lists for Wed, 4 Feb 26
- [47] arXiv:2602.02583 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Copula-Based Aggregation and Context-Aware Conformal Prediction for Reliable Renewable Energy ForecastingSubjects: Machine Learning (cs.LG); Applications (stat.AP)
The rapid growth of renewable energy penetration has intensified the need for reliable probabilistic forecasts to support grid operations at aggregated (fleet or system) levels. In practice, however, system operators often lack access to fleet-level probabilistic models and instead rely on site-level forecasts produced by heterogeneous third-party providers. Constructing coherent and calibrated fleet-level probabilistic forecasts from such inputs remains challenging due to complex cross-site dependencies and aggregation-induced miscalibration. This paper proposes a calibrated probabilistic aggregation framework that directly converts site-level probabilistic forecasts into reliable fleet-level forecasts in settings where system-level models cannot be trained or maintained. The framework integrates copula-based dependence modeling to capture cross-site correlations with Context-Aware Conformal Prediction (CACP) to correct miscalibration at the aggregated level. This combination enables dependence-aware aggregation while providing valid coverage and maintaining sharp prediction intervals. Experiments on large-scale solar generation datasets from MISO, ERCOT, and SPP demonstrate that the proposed Copula+CACP approach consistently achieves near-nominal coverage with significantly sharper intervals than uncalibrated aggregation baselines.
- [48] arXiv:2602.02596 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Fubini Study geometry of representation drift in high dimensional dataAuthors: Arturo TozziComments: 8 pages, 1 figureSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
High dimensional representation drift is commonly quantified using Euclidean or cosine distances, which presuppose fixed coordinates when comparing representations across time, training or preprocessing stages. While effective in many settings, these measures entangle intrinsic changes in the data with variations induced by arbitrary parametrizations. We introduce a projective geometric view of representation drift grounded in the Fubini Study metric, which identifies representations that differ only by gauge transformations such as global rescalings or sign flips. Applying this framework to empirical high dimensional datasets, we explicitly construct representation trajectories and track their evolution through cumulative geometric drift. Comparing Euclidean, cosine and Fubini Study distances along these trajectories reveals that conventional metrics systematically overestimate change whenever representations carry genuine projective ambiguity. By contrast, the Fubini Study metric isolates intrinsic evolution by remaining invariant under gauge-induced fluctuations. We further show that the difference between cosine and Fubini Study drift defines a computable, monotone quantity that directly captures representation churn attributable to gauge freedom. This separation provides a diagnostic for distinguishing meaningful structural evolution from parametrization artifacts, without introducing model-specific assumptions. Overall, we establish a geometric criterion for assessing representation stability in high-dimensional systems and clarify the limits of angular distances. Embedding representation dynamics in projective space connects data analysis with established geometric programs and yields observables that are directly testable in empirical workflows.
- [49] arXiv:2602.02626 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Learning Better Certified Models from Empirically-Robust TeachersAuthors: Alessandro De PalmaSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Adversarial training attains strong empirical robustness to specific adversarial attacks by training on concrete adversarial perturbations, but it produces neural networks that are not amenable to strong robustness certificates through neural network verification. On the other hand, earlier certified training schemes directly train on bounds from network relaxations to obtain models that are certifiably robust, but display sub-par standard performance. Recent work has shown that state-of-the-art trade-offs between certified robustness and standard performance can be obtained through a family of losses combining adversarial outputs and neural network bounds. Nevertheless, differently from empirical robustness, verifiability still comes at a significant cost in standard performance. In this work, we propose to leverage empirically-robust teachers to improve the performance of certifiably-robust models through knowledge distillation. Using a versatile feature-space distillation objective, we show that distillation from adversarially-trained teachers consistently improves on the state-of-the-art in certified training for ReLU networks across a series of robust computer vision benchmarks.
- [50] arXiv:2602.02706 (cross-list from physics.space-ph) [pdf, ps, other]
-
Title: Ionospheric Observations from the ISS: Overcoming Noise Challenges in Signal ExtractionAuthors: Rachel Ulrich, Kelly R. Moran, Ky Potter, Lauren A. Castro, Gabriel R. Wilson, Brian Weaver, Carlos MaldonadoSubjects: Space Physics (physics.space-ph); Applications (stat.AP)
The Electric Propulsion Electrostatic Analyzer Experiment (\`EP\`EE) is a compact ion energy bandpass filter deployed on the International Space Station (ISS) in March 2023 and providing continuous measurements through April 2024. This period coincides with the Solar Cycle 25 maximum, capturing unique observations of solar activity extremes in the mid- to low-latitude regions of the topside ionosphere. From these in situ spectra we derive plasma parameters that inform space-weather impacts on satellite navigation and radio communication. We present a statistical processing pipeline for \`EP\`EE that (i) estimates the instrument noise floor, (ii) accounts for irregular temporal sampling, and (iii) extracts ionospheric signals. Rather than discarding noisy data, the method learns a baseline noise model and fits the measurement surface using a scaled Vecchia Gaussian process approximation, recovering values typically rejected by thresholding. The resulting products increase data coverage and enable noise-assisted monitoring of ionospheric variability.
- [51] arXiv:2602.02819 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Membership Inference Attacks from Causal PrinciplesAuthors: Mathieu Even, Clément Berenfeld, Linus Bleistein, Tudor Cebere, Julie Josse, Aurélien BelletSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Membership Inference Attacks (MIAs) are widely used to quantify training data memorization and assess privacy risks. Standard evaluation requires repeated retraining, which is computationally costly for large models. One-run methods (single training with randomized data inclusion) and zero-run methods (post hoc evaluation) are often used instead, though their statistical validity remains unclear. To address this gap, we frame MIA evaluation as a causal inference problem, defining memorization as the causal effect of including a data point in the training set. This novel formulation reveals and formalizes key sources of bias in existing protocols: one-run methods suffer from interference between jointly included points, while zero-run evaluations popular for LLMs are confounded by non-random membership assignment. We derive causal analogues of standard MIA metrics and propose practical estimators for multi-run, one-run, and zero-run regimes with non-asymptotic consistency guarantees. Experiments on real-world data show that our approach enables reliable memorization measurement even when retraining is impractical and under distribution shift, providing a principled foundation for privacy evaluation in modern AI systems.
- [52] arXiv:2602.02830 (cross-list from cs.LG) [pdf, ps, other]
-
Title: SC3D: Dynamic and Differentiable Causal Discovery for Temporal and Instantaneous GraphsComments: 8 pagesSubjects: Machine Learning (cs.LG); Methodology (stat.ME)
Discovering causal structures from multivariate time series is a key problem because interactions span across multiple lags and possibly involve instantaneous dependencies. Additionally, the search space of the dynamic graphs is combinatorial in nature. In this study, we propose \textit{Stable Causal Dynamic Differentiable Discovery (SC3D)}, a two-stage differentiable framework that jointly learns lag-specific adjacency matrices and, if present, an instantaneous directed acyclic graph (DAG). In Stage 1, SC3D performs edge preselection through node-wise prediction to obtain masks for lagged and instantaneous edges, whereas Stage 2 refines these masks by optimizing a likelihood with sparsity along with enforcing acyclicity on the instantaneous block. Numerical results across synthetic and benchmark dynamical systems demonstrate that SC3D achieves improved stability and more accurate recovery of both lagged and instantaneous causal structures compared to existing temporal baselines.
- [53] arXiv:2602.02855 (cross-list from cs.LG) [pdf, ps, other]
-
Title: When pre-training hurts LoRA fine-tuning: a dynamical analysis via single-index modelsSubjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistics Theory (math.ST)
Pre-training on a source task is usually expected to facilitate fine-tuning on similar downstream problems. In this work, we mathematically show that this naive intuition is not always true: excessive pre-training can computationally slow down fine-tuning optimization. We study this phenomenon for low-rank adaptation (LoRA) fine-tuning on single-index models trained under one-pass SGD. Leveraging a summary statistics description of the fine-tuning dynamics, we precisely characterize how the convergence rate depends on the initial fine-tuning alignment and the degree of non-linearity of the target task. The key take away is that even when the pre-training and down- stream tasks are well aligned, strong pre-training can induce a prolonged search phase and hinder convergence. Our theory thus provides a unified picture of how pre-training strength and task difficulty jointly shape the dynamics and limitations of LoRA fine-tuning in a nontrivial tractable model.
- [54] arXiv:2602.02908 (cross-list from cs.LG) [pdf, ps, other]
-
Title: A Random Matrix Theory Perspective on the Consistency of Diffusion ModelsComments: 65 pages; 53 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Diffusion models trained on different, non-overlapping subsets of a dataset often produce strikingly similar outputs when given the same noise seed. We trace this consistency to a simple linear effect: the shared Gaussian statistics across splits already predict much of the generated images. To formalize this, we develop a random matrix theory (RMT) framework that quantifies how finite datasets shape the expectation and variance of the learned denoiser and sampling map in the linear setting. For expectations, sampling variability acts as a renormalization of the noise level through a self-consistent relation $\sigma^2 \mapsto \kappa(\sigma^2)$, explaining why limited data overshrink low-variance directions and pull samples toward the dataset mean. For fluctuations, our variance formulas reveal three key factors behind cross-split disagreement: \textit{anisotropy} across eigenmodes, \textit{inhomogeneity} across inputs, and overall scaling with dataset size. Extending deterministic-equivalence tools to fractional matrix powers further allows us to analyze entire sampling trajectories. The theory sharply predicts the behavior of linear diffusion models, and we validate its predictions on UNet and DiT architectures in their non-memorization regime, identifying where and how samples deviates across training data split. This provides a principled baseline for reproducibility in diffusion training, linking spectral properties of data to the stability of generative outputs.
- [55] arXiv:2602.02912 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Notes on the Reward Representation of Posterior UpdatesAuthors: Pedro A. OrtegaComments: Technical report, 9 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Many ideas in modern control and reinforcement learning treat decision-making as inference: start from a baseline distribution and update it when a signal arrives. We ask when this can be made literal rather than metaphorical. We study the special case where a KL-regularized soft update is exactly a Bayesian posterior inside a single fixed probabilistic model, so the update variable is a genuine channel through which information is transmitted. In this regime, behavioral change is driven only by evidence carried by that channel: the update must be explainable as an evidence reweighing of the baseline. This yields a sharp identification result: posterior updates determine the relative, context-dependent incentive signal that shifts behavior, but they do not uniquely determine absolute rewards, which remain ambiguous up to context-specific baselines. Requiring one reusable continuation value across different update directions adds a further coherence constraint linking the reward descriptions associated with different conditioning orders.
- [56] arXiv:2602.02986 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Why Some Models Resist Unlearning: A Linear Stability PerspectiveSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Machine unlearning, the ability to erase the effect of specific training samples without retraining from scratch, is critical for privacy, regulation, and efficiency. However, most progress in unlearning has been empirical, with little theoretical understanding of when and why unlearning works. We tackle this gap by framing unlearning through the lens of asymptotic linear stability to capture the interaction between optimization dynamics and data geometry. The key quantity in our analysis is data coherence which is the cross sample alignment of loss surface directions near the optimum. We decompose coherence along three axes: within the retain set, within the forget set, and between them, and prove tight stability thresholds that separate convergence from divergence. To further link data properties to forgettability, we study a two layer ReLU CNN under a signal plus noise model and show that stronger memorization makes forgetting easier: when the signal to noise ratio (SNR) is lower, cross sample alignment is weaker, reducing coherence and making unlearning easier; conversely, high SNR, highly aligned models resist unlearning. For empirical verification, we show that Hessian tests and CNN heatmaps align closely with the predicted boundary, mapping the stability frontier of gradient based unlearning as a function of batching, mixing, and data/model alignment. Our analysis is grounded in random matrix theory tools and provides the first principled account of the trade offs between memorization, coherence, and unlearning.
- [57] arXiv:2602.03055 (cross-list from eess.SP) [pdf, ps, other]
-
Title: Stationarity and Spectral Characterization of Random Signals on Simplicial ComplexesSubjects: Signal Processing (eess.SP); Machine Learning (stat.ML)
It is increasingly common for data to possess intricate structure, necessitating new models and analytical tools. Graphs, a prominent type of structure, can encode the relationships between any two entities (nodes). However, graphs neither allow connections that are not dyadic nor permit relationships between sets of nodes. We thus turn to simplicial complexes for connecting more than two nodes as well as modeling relationships between simplices, such as edges and triangles. Our data then consist of signals lying on topological spaces, represented by simplicial complexes. Much recent work explores these topological signals, albeit primarily through deterministic formulations. We propose a probabilistic framework for random signals defined on simplicial complexes. Specifically, we generalize the classical notion of stationarity. By spectral dualities of Hodge and Dirac theory, we define stationary topological signals as the outputs of topological filters given white noise. This definition naturally extends desirable properties of stationarity that hold for both time-series and graph signals. Crucially, we properly define topological power spectral density (PSD) through a clear spectral characterization. We then discuss the advantages of topological stationarity due to spectral properties via the PSD. In addition, we empirically demonstrate the practicality of these benefits through multiple synthetic and real-world simulations.
- [58] arXiv:2602.03061 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative SignalsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
Evaluating mathematical reasoning in LLMs is constrained by limited benchmark sizes and inherent model stochasticity, yielding high-variance accuracy estimates and unstable rankings across platforms. On difficult problems, an LLM may fail to produce a correct final answer, yet still provide reliable pairwise comparison signals indicating which of two candidate solutions is better. We leverage this observation to design a statistically efficient evaluation framework that combines standard labeled outcomes with pairwise comparison signals obtained by having models judge auxiliary reasoning chains. Treating these comparison signals as control variates, we develop a semiparametric estimator based on the efficient influence function (EIF) for the setting where auxiliary reasoning chains are observed. This yields a one-step estimator that achieves the semiparametric efficiency bound, guarantees strict variance reduction over naive sample averaging, and admits asymptotic normality for principled uncertainty quantification. Across simulations, our one-step estimator substantially improves ranking accuracy, with gains increasing as model output noise grows. Experiments on GPQA Diamond, AIME 2025, and GSM8K further demonstrate more precise performance estimation and more reliable model rankings, especially in small-sample regimes where conventional evaluation is pretty unstable.
- [59] arXiv:2602.03143 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Self-Hinting Language Models Enhance Reinforcement LearningSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt $x$, the model samples a compact hint $h$ (e.g., a plan or decomposition) and then generates a solution $\tau$ conditioned on $(x,h)$. Crucially, the task reward $R(x,\tau)$ is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set $h=\varnothing$ and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.
- [60] arXiv:2602.03325 (cross-list from q-fin.PM) [pdf, ps, other]
-
Title: A Novel approach to portfolio constructionSubjects: Portfolio Management (q-fin.PM); Machine Learning (cs.LG); Computational Finance (q-fin.CP); Risk Management (q-fin.RM); Machine Learning (stat.ML)
This paper proposes a machine learning-based framework for asset selection and portfolio construction, termed the Best-Path Algorithm Sparse Graphical Model (BPASGM). The method extends the Best-Path Algorithm (BPA) by mapping linear and non-linear dependencies among a large set of financial assets into a sparse graphical model satisfying a structural Markov property. Based on this representation, BPASGM performs a dependence-driven screening that removes positively or redundantly connected assets, isolating subsets that are conditionally independent or negatively correlated. This step is designed to enhance diversification and reduce estimation error in high-dimensional portfolio settings. Portfolio optimization is then conducted on the selected subset using standard mean-variance techniques. BPASGM does not aim to improve the theoretical mean-variance optimum under known population parameters, but rather to enhance realized performance in finite samples, where sample-based Markowitz portfolios are highly sensitive to estimation error. Monte Carlo simulations show that BPASGM-based portfolios achieve more stable risk-return profiles, lower realized volatility, and superior risk-adjusted performance compared to standard mean-variance portfolios. Empirical results for U.S. equities, global stock indices, and foreign exchange rates over 1990-2025 confirm these findings and demonstrate a substantial reduction in portfolio cardinality. Overall, BPASGM offers a statistically grounded and computationally efficient framework that integrates sparse graphical modeling with portfolio theory for dependence-aware asset selection.
- [61] arXiv:2602.03459 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Causal Inference on Networks under Misspecified Exposure Mappings: A Partial Identification FrameworkSubjects: Machine Learning (cs.LG); Methodology (stat.ME)
Estimating treatment effects in networks is challenging, as each potential outcome depends on the treatments of all other nodes in the network. To overcome this difficulty, existing methods typically impose an exposure mapping that compresses the treatment assignments in the network into a low-dimensional summary. However, if this mapping is misspecified, standard estimators for direct and spillover effects can be severely biased. We propose a novel partial identification framework for causal inference on networks to assess the robustness of treatment effects under misspecifications of the exposure mapping. Specifically, we derive sharp upper and lower bounds on direct and spillover effects under such misspecifications. As such, our framework presents a novel application of causal sensitivity analysis to exposure mappings. We instantiate our framework for three canonical exposure settings widely used in practice: (i) weighted means of the neighborhood treatments, (ii) threshold-based exposure mappings, and (iii) truncated neighborhood interference in the presence of higher-order spillovers. Furthermore, we develop orthogonal estimators for these bounds and prove that the resulting bound estimates are valid, sharp, and efficient. Our experiments show the bounds remain informative and provide reliable conclusions under misspecification of exposure mappings.
- [62] arXiv:2602.03461 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Soft-Radial Projection for Constrained End-to-End LearningSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Computational Finance (q-fin.CP); Machine Learning (stat.ML)
Integrating hard constraints into deep learning is essential for safety-critical systems. Yet existing constructive layers that project predictions onto constraint boundaries face a fundamental bottleneck: gradient saturation. By collapsing exterior points onto lower-dimensional surfaces, standard orthogonal projections induce rank-deficient Jacobians, which nullify gradients orthogonal to active constraints and hinder optimization. We introduce Soft-Radial Projection, a differentiable reparameterization layer that circumvents this issue through a radial mapping from Euclidean space into the interior of the feasible set. This construction guarantees strict feasibility while preserving a full-rank Jacobian almost everywhere, thereby preventing the optimization stalls typical of boundary-based methods. We theoretically prove that the architecture retains the universal approximation property and empirically show improved convergence behavior and solution quality over state-of-the-art optimization- and projection-based baselines.
- [63] arXiv:2602.03466 (cross-list from quant-ph) [pdf, ps, other]
-
Title: Quantum Circuit Generation via test-time learning with large language modelsAuthors: Adriano Macarone-PalmieriComments: 9 pages, 1 figureSubjects: Quantum Physics (quant-ph); Machine Learning (stat.ML)
Large language models (LLMs) can generate structured artifacts, but using them as dependable optimizers for scientific design requires a mechanism for iterative improvement under black-box evaluation. Here, we cast quantum circuit synthesis as a closed-loop, test-time optimization problem: an LLM proposes edits to a fixed-length gate list, and an external simulator evaluates the resulting state with the Meyer-Wallach (MW) global entanglement measure. We introduce a lightweight test-time learning recipe that can reuse prior high-performing candidates as an explicit memory trace, augments prompts with a score-difference feedback, and applies restart-from-the-best sampling to escape potential plateaus. Across fixed 20-qubit settings, the loop without feedback and restart-from-the-best improves random initial circuits over a range of gate budgets. To lift up this performance and success rate, we use the full learning strategy. For 25-qubit, it mitigates a pronounced performance plateau when naive querying is used. Beyond raw scores, we analyze the structure of synthesized states and find that high MW solutions can correspond to stabilizer or graph-state-like constructions, but full connectivity is not guaranteed due to the metric property and prompt design. These results illustrate both the promise and the pitfalls of memory evaluator-guided LLM optimization for circuit synthesis, highlighting the critical role of prior human-made theoretical theorem to optimally design a custom tool in support of research.
- [64] arXiv:2602.03514 (cross-list from cs.LG) [pdf, ps, other]
-
Title: A Function-Space Stability Boundary for Generalization in Interpolating Learning SystemsAuthors: Ronald KatendeComments: 10 pages, 8 figures,Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Modern learning systems often interpolate training data while still generalizing well, yet it remains unclear when algorithmic stability explains this behavior. We model training as a function-space trajectory and measure sensitivity to single-sample perturbations along this trajectory.
We propose a contractive propagation condition and a stability certificate obtained by unrolling the resulting recursion. A small certificate implies stability-based generalization, while we also prove that there exist interpolating regimes with small risk where such contractive sensitivity cannot hold, showing that stability is not a universal explanation.
Experiments confirm that certificate growth predicts generalization differences across optimizers, step sizes, and dataset perturbations. The framework therefore identifies regimes where stability explains generalization and where alternative mechanisms must account for success. - [65] arXiv:2602.03566 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Riemannian Neural Optimal TransportComments: 58 pagesSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Computational optimal transport (OT) offers a principled framework for generative modeling. Neural OT methods, which use neural networks to learn an OT map (or potential) from data in an amortized way, can be evaluated out of sample after training, but existing approaches are tailored to Euclidean geometry. Extending neural OT to high-dimensional Riemannian manifolds remains an open challenge. In this paper, we prove that any method for OT on manifolds that produces discrete approximations of transport maps necessarily suffers from the curse of dimensionality: achieving a fixed accuracy requires a number of parameters that grows exponentially with the manifold dimension. Motivated by this limitation, we introduce Riemannian Neural OT (RNOT) maps, which are continuous neural-network parameterizations of OT maps on manifolds that avoid discretization and incorporate geometric structure by construction. Under mild regularity assumptions, we prove that RNOT maps approximate Riemannian OT maps with sub-exponential complexity in the dimension. Experiments on synthetic and real datasets demonstrate improved scalability and competitive performance relative to discretization-based baselines.
- [66] arXiv:2602.03685 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Universal One-third Time Scaling in Learning Peaked DistributionsComments: 24 pages, 6 main text figures, 27 figures in totalSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origin remains debatable. Through systematic analysis of toy models and empirical evaluation of LLMs, we show that this behavior can arise intrinsically from the use of softmax and cross-entropy. When learning peaked probability distributions, e.g., next-token distributions, these components yield power-law vanishing losses and gradients, creating a fundamental optimization bottleneck. This ultimately leads to power-law time scaling of the loss with a universal exponent of $1/3$. Our results provide a mechanistic explanation for observed neural scaling and suggest new directions for improving LLM training efficiency.
- [67] arXiv:2602.03702 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight AveragingSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
Large language models are increasingly trained in continual or open-ended settings, where the total training horizon is not known in advance. Despite this, most existing pretraining recipes are not anytime: they rely on horizon-dependent learning rate schedules and extensive tuning under a fixed compute budget. In this work, we provide a theoretical analysis demonstrating the existence of anytime learning schedules for overparameterized linear regression, and we highlight the central role of weight averaging - also known as model merging - in achieving the minimax convergence rates of stochastic gradient descent. We show that these anytime schedules polynomially decay with time, with the decay rate determined by the source and capacity conditions of the problem. Empirically, we evaluate 150M and 300M parameter language models trained at 1-32x Chinchilla scale, comparing constant learning rates with weight averaging and $1/\sqrt{t}$ schedules with weight averaging against a well-tuned cosine schedule. Across the full training range, the anytime schedules achieve comparable final loss to cosine decay. Taken together, our results suggest that weight averaging combined with simple, horizon-free step sizes offers a practical and effective anytime alternative to cosine learning rate schedules for large language model pretraining.
- [68] arXiv:2602.03740 (cross-list from math.PR) [pdf, ps, other]
-
Title: On the compatibility between the spatial moments and the codomain of a real random fieldSubjects: Probability (math.PR); Statistics Theory (math.ST)
While any symmetric and positive semidefinite mapping can be the non-centered covariance of a Gaussian random field, it is known that these conditions are no longer sufficient when the random field is valued in a two-point set. The question therefore arises of what are the necessary and sufficient conditions for a mapping $\rho: \X \times \X \to \R$ to be the non-centered covariance of a random field with values in a subset ${\cE}$ of $\R$. Such conditions are presented in the general case when ${\cE}$ is a closed subset of the real line, then examined for some specific cases. In particular, if ${\cE}=\R$ or $\Z$, it is shown that the conditions reduce to $\rho$ being symmetric and positive semidefinite. If ${\cE}$ is a closed interval or a two-point set, the necessary and sufficient conditions are more restrictive: the symmetry, positive semidefiniteness, upper and lower boundedness of $\rho$ are no longer enough to guarantee the existence of a random field valued in ${\cE}$ and having $\rho$ as its non-centered covariance. Similar characterizations are obtained for semivariograms and higher-order spatial moments, as well as for multivariate random fields.
Replacements for Wed, 4 Feb 26
- [69] arXiv:2301.07473 (replaced) [pdf, ps, other]
-
Title: Discrete Latent Structure in Neural NetworksSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
- [70] arXiv:2312.07397 (replaced) [pdf, ps, other]
-
Title: Neural Entropic Optimal Transport and Gromov-Wasserstein AlignmentSubjects: Statistics Theory (math.ST)
- [71] arXiv:2404.02070 (replaced) [pdf, ps, other]
-
Title: Asymptotics of resampling without replacement in robust and logistic regressionComments: 27 pages, 8 figuresSubjects: Statistics Theory (math.ST)
- [72] arXiv:2407.03094 (replaced) [pdf, ps, other]
-
Title: Conformal Prediction for Causal Effects of Continuous TreatmentsAuthors: Maresa Schröder, Dennis Frauen, Jonas Schweisthal, Konstantin Heß, Valentyn Melnychuk, Stefan FeuerriegelComments: Accepted at NeurIPS 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
- [73] arXiv:2407.11937 (replaced) [pdf, ps, other]
- [74] arXiv:2408.14940 (replaced) [pdf, ps, other]
-
Title: Bayesian spatiotemporal modelling of political violence and conflict events using discrete-time Hawkes processesSubjects: Applications (stat.AP)
- [75] arXiv:2410.03619 (replaced) [pdf, ps, other]
-
Title: Functional-SVD for Heterogeneous Trajectories: Case Studies in HealthComments: Journal of the American Statistical Association, to appearSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Applications (stat.AP); Computation (stat.CO)
- [76] arXiv:2410.23222 (replaced) [pdf, ps, other]
-
Title: Dataset-Driven Channel Masks in Transformers for Multivariate Time SeriesComments: ICASSP 2026. Preliminary version: NeurIPS Workshop on Time Series in the Age of Large Models 2024 (Oral presentation)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
- [77] arXiv:2411.06501 (replaced) [pdf, ps, other]
-
Title: Individual Regret in Cooperative Stochastic Multi-Armed BanditsComments: 55 pages, 1 figureSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
- [78] arXiv:2411.14349 (replaced) [pdf, ps, other]
-
Title: Agnostic Learning of Arbitrary ReLU Activation under Gaussian MarginalsSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
- [79] arXiv:2501.00382 (replaced) [pdf, ps, other]
-
Title: Adventures in Demand Analysis Using AIAuthors: Philipp Bach, Victor Chernozhukov, Sven Klaassen, Martin Spindler, Jan Teichert-Kluge, Suhas VijaykumarComments: 35 pages, 8 figuresSubjects: General Economics (econ.GN); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
- [80] arXiv:2503.00014 (replaced) [pdf, ps, other]
-
Title: LSD of the Commutator of two data MatricesComments: arXiv admin note: substantial text overlap with arXiv:2409.16780Subjects: Statistics Theory (math.ST); Probability (math.PR)
- [81] arXiv:2503.19859 (replaced) [pdf, ps, other]
-
Title: An Overview of Low-Rank Structures in the Training and Adaptation of Large ModelsAuthors: Laura Balzano, Tianjiao Ding, Benjamin D. Haeffele, Soo Min Kwon, Qing Qu, Peng Wang, Zhangyang Wang, Can YarasComments: Authors are listed alphabetically; 37 pages, 15 figures; minor revision at IEEE Signal Processing MagazineSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC); Computation (stat.CO); Machine Learning (stat.ML)
- [82] arXiv:2503.22366 (replaced) [pdf, ps, other]
-
Title: Conditional Extreme Value Estimation for Dependent Time SeriesJournal-ref: Bladt, M., Glargaard, L. & Henningsen, T. Conditional extreme value estimation for dependent time series. Extremes (2026)Subjects: Statistics Theory (math.ST)
- [83] arXiv:2504.06799 (replaced) [pdf, ps, other]
-
Title: Compatibility of Missing Data Handling Methods across the Stages of Producing Clinical Prediction ModelsAuthors: Antonia Tsvetanova, Matthew Sperrin, David A. Jenkins, Niels Peek, Iain Buchan, Stephanie Hyland, Marcus Taylor, Angela Wood, Richard D. Riley, Glen P. MartinComments: 40 pages, 6 figures (6 supplementary figures)Subjects: Methodology (stat.ME)
- [84] arXiv:2505.06927 (replaced) [pdf, ps, other]
-
Title: Stability Regularized Cross-ValidationComments: Some of this material previously appeared in 2306.14851v2, which we have split into two papers (this one and 2306.14851v3), because it contained two ideas that need separate papersSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
- [85] arXiv:2505.08395 (replaced) [pdf, ps, other]
-
Title: Bayesian Estimation of Causal Effects Using Proxies of a Latent Interference NetworkSubjects: Methodology (stat.ME); Applications (stat.AP); Computation (stat.CO); Machine Learning (stat.ML); Other Statistics (stat.OT)
- [86] arXiv:2505.12387 (replaced) [pdf, ps, other]
-
Title: Neural Thermodynamics: Entropic Forces in Deep and Universal Representation LearningComments: Published at NeurIPS 2025Subjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Mathematical Physics (math-ph); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
- [87] arXiv:2505.15543 (replaced) [pdf, ps, other]
-
Title: Heavy-tailed and Horseshoe priors for regression and sparse Besov ratesComments: 36 pages, 6 figuresSubjects: Statistics Theory (math.ST)
- [88] arXiv:2505.16644 (replaced) [pdf, ps, other]
-
Title: Learning non-equilibrium diffusions with Schrödinger bridges: from exactly solvable to simulation-freeComments: 10 pages, 5 figures, NeurIPS 2025Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
- [89] arXiv:2505.17961 (replaced) [pdf, ps, other]
-
Title: Federated Causal Inference from Multi-Site Observational Data via Propensity Score AggregationSubjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Applications (stat.AP)
- [90] arXiv:2505.23506 (replaced) [pdf, ps, other]
-
Title: Position: Epistemic uncertainty estimation methods are fundamentally incompleteSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
- [91] arXiv:2506.07096 (replaced) [pdf, ps, other]
-
Title: Efficient and Robust Block Designs for Order-of-Addition ExperimentsAuthors: Chang-Yun LinSubjects: Methodology (stat.ME); Applications (stat.AP)
- [92] arXiv:2507.08261 (replaced) [pdf, ps, other]
-
Title: Admissibility of Stein Shrinkage for Batch Normalization in the Presence of Adversarial AttacksSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
- [93] arXiv:2507.18554 (replaced) [pdf, ps, other]
-
Title: How weak are weak factors? Uniform inference for signal strength in signal plus noise modelsComments: 76 pages, 6 figures. v2: extended discussion and additional referencesSubjects: Methodology (stat.ME); Econometrics (econ.EM); Probability (math.PR); Statistics Theory (math.ST)
- [94] arXiv:2507.22218 (replaced) [pdf, ps, other]
-
Title: Attenuation Bias with Latent PredictorsComments: 37 pagesSubjects: Applications (stat.AP)
- [95] arXiv:2508.11847 (replaced) [pdf, ps, other]
-
Title: Dropping Just a Handful of Preferences Can Change Top Large Language Model RankingsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
- [96] arXiv:2509.17382 (replaced) [src]
-
Title: Bias-variance Tradeoff in Tensor EstimationAuthors: Shivam Kumar, Haotian Xu, Carlos Misael Madrid Padilla, Yuehaw Khoo, Oscar Hernan Madrid Padilla, Daren WangComments: We are withdrawing the paper in order to update it with more consistent results and improved presentation. We plan to strengthen the analysis and ensure that the results are aligned more clearly throughout the manuscriptSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
- [97] arXiv:2509.26096 (replaced) [pdf, ps, other]
-
Title: EVODiff: Entropy-aware Variance Optimized Diffusion InferenceComments: NeurIPS 2025, 41 pages, 14 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
- [98] arXiv:2510.10000 (replaced) [pdf, ps, other]
-
Title: Tight Robustness Certificates and Wasserstein Distributional Attacks for Deep Neural NetworksSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
- [99] arXiv:2510.16798 (replaced) [pdf, ps, other]
-
Title: Causal inference for calibrated scaling interventions on time-to-event processesComments: Added a simulation study; manuscript shortened and reorganized in preparation for journal submissionSubjects: Methodology (stat.ME)
- [100] arXiv:2510.19785 (replaced) [pdf, ps, other]
-
Title: Green Finance and Carbon Emissions: A Nonlinear and Interaction Analysis Using Bayesian Additive Regression TreesComments: 16 pages, 8 figures, pre-print articleSubjects: Applications (stat.AP)
- [101] arXiv:2511.10718 (replaced) [pdf, ps, other]
-
Title: Online Price Competition under Generalized Linear DemandsSubjects: Computer Science and Game Theory (cs.GT); Statistics Theory (math.ST); Methodology (stat.ME)
- [102] arXiv:2512.00242 (replaced) [pdf, ps, other]
-
Title: Polynomial Neural Sheaf Diffusion: A Spectral Filtering Approach on Cellular SheavesComments: Under Review at ICML 2026Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (stat.ML)
- [103] arXiv:2512.21577 (replaced) [pdf, ps, other]
-
Title: A Unified Definition of Hallucination: It's The World Model, Stupid!Authors: Emmy Liu, Varun Gangal, Chelsea Zou, Michael Yu, Xiaoqi Huang, Alex Chang, Zhuofu Tao, Karan Singh, Sachin Kumar, Steven Y. FengComments: HalluWorld benchmark in progress. Repo at this https URLSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
- [104] arXiv:2601.09693 (replaced) [pdf, ps, other]
-
Title: Contrastive Geometric Learning Unlocks Unified Structure- and Ligand-Based Drug DesignComments: ELLIS ML4Molecules Workshop 2025, ELLIS Unconference, Copenhagen 2025 Revised version with additional timing evaluationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
- [105] arXiv:2601.15468 (replaced) [pdf, ps, other]
-
Title: Learning from Synthetic Data: Limitations of ERMSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
- [106] arXiv:2601.16174 (replaced) [pdf, ps, other]
-
Title: Beyond Predictive Uncertainty: Reliable Representation Learning with Structural ConstraintsAuthors: Yiyao YangComments: 22 pages, 5 figures, 5 propositionsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
- [107] arXiv:2601.16196 (replaced) [pdf, ps, other]
-
Title: Inference on the Significance of Modalities in Multimodal Generalized Linear ModelsComments: This research was supported by the National Institutes of Health under grant R01-AG073259Subjects: Methodology (stat.ME)
- [108] arXiv:2601.16340 (replaced) [pdf, ps, other]
-
Title: Matrix-Response Generalized Linear Mixed Model with Applications to Longitudinal Brain ImagesComments: This research was supported by the National Institutes of Health under grant R01-AG073259Subjects: Applications (stat.AP)
- [109] arXiv:2601.17160 (replaced) [pdf, ps, other]
-
Title: Information-Theoretic Causal Bounds under Unmeasured ConfoundingSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
- [110] arXiv:2601.17217 (replaced) [pdf, ps, other]
-
Title: Transfer learning for scalar-on-function regression via control variatesComments: 45 pages, 2 figuresSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
- [111] arXiv:2601.20197 (replaced) [pdf, ps, other]
-
Title: Bias-Reduced Estimation of Finite Mixtures: An Application to Latent Group Structures in Panel DataAuthors: Raphaël LangevinSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM); Computation (stat.CO)
- [112] arXiv:2601.21170 (replaced) [pdf, ps, other]
-
Title: The Powers of Precision: Structure-Informed Detection in Complex Systems -- From Customer Churn to Seizure OnsetSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
- [113] arXiv:2602.00989 (replaced) [pdf, ps, other]
-
Title: Optimal Decision-Making Based on Prediction SetsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[ showing up to 2000 entries per page: fewer | more ]
Disable MathJax (What is MathJax?)
Links to: arXiv, form interface, find, stat, recent, 2602, contact, help (Access key information)