Statistics
New submissions
[ showing up to 2000 entries per page: fewer | more ]
New submissions for Fri, 20 Mar 26
- [1] arXiv:2603.18114 [pdf, ps, other]
-
Title: Transfer Learning for Contextual Joint Assortment-Pricing under Cross-Market HeterogeneitySubjects: Methodology (stat.ME); Machine Learning (cs.LG)
We study transfer learning for contextual joint assortment-pricing under a multinomial logit choice model with bandit feedback. A seller operates across multiple related markets and observes only posted prices and realized purchases. While data from source markets can accelerate learning in a target market, cross-market differences in customer preferences may introduce systematic bias if pooled indiscriminately.
We model heterogeneity through a structured utility shift, where markets share a common contextual utility structure but differ along a sparse set of latent preference coordinates. Building on this, we develop Transfer Joint Assortment-Pricing (TJAP), a bias-aware framework that combines aggregate-then-debias estimation with a UCB-style policy. TJAP constructs two-radius confidence bounds that separately capture statistical uncertainty and transfer-induced bias, uniformly over continuous prices.
We establish matching minimax regret bounds of order $\tilde{O}\!\left(d\sqrt{\frac{T}{1+H}} + s_0\sqrt{T}\right),$revealing a transparent variance-bias tradeoff: transfer accelerates learning along shared preference directions, while heterogeneous components impose an irreducible adaptation cost. Numerical experiments corroborate the theory, showing that TJAP outperforms both target-only learning and naive pooling while remaining robust to cross-market differences. - [2] arXiv:2603.18149 [pdf, ps, other]
-
Title: Analysing Extreme Rainfall via a Geometric FrameworkSubjects: Methodology (stat.ME)
Motivated by the EVA 2025 Data Challenge, we address the problem of predicting extreme rainfall in the eastern United States using data from a large ensemble of climate model runs. The challenge focuses on three quantities of interest related to the spatial extent and/or temporal duration of extreme rainfall, each requiring extrapolation. To tackle these questions, we adopt the recently developed geometric framework for extreme-value analysis, offering substantial flexibility for capturing complex extremal dependence structures and enabling extrapolation across the entire multivariate tail. In this work, we focus on the spatial geometric framework for analysing the spatial extent and consider a sampling procedure that retains the temporal information in the data, thereby enabling estimation of the duration of extreme rainfall events. We also account for the non-stationary behaviour, arising from topographical and seasonal effects, that commonly characterises extreme weather events in both space and time. Using diagnostic metrics, we demonstrate that the proposed model is appropriate for inferring extreme events on this dataset and apply it to estimate target quantities of interest.
- [3] arXiv:2603.18168 [pdf, ps, other]
-
Title: ResNets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-scale LimitSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We establish convergence of the training dynamics of residual neural networks (ResNets) to their joint infinite depth L, hidden width M, and embedding dimension D limit. Specifically, we consider ResNets with two-layer perceptron blocks in the maximal local feature update (MLU) regime and prove that, after a bounded number of training steps, the error between the ResNet and its large-scale limit is O(1/L + sqrt(D/(L M)) + 1/sqrt(D)). This error rate is empirically tight when measured in embedding space. For a budget of P = Theta(L M D) parameters, this yields a convergence rate O(P^(-1/6)) for the scalings of (L, M, D) that minimize the bound. Our analysis exploits in an essential way the depth-two structure of residual blocks and applies formally to a broad class of state-of-the-art architectures, including Transformers with bounded key-query dimension. From a technical viewpoint, this work completes the program initiated in the companion paper [Chi25] where it is proved that for a fixed embedding dimension D, the training dynamics converges to a Mean ODE dynamics at rate O(1/L + sqrt(D)/sqrt(L M)). Here, we study the large-D limit of this Mean ODE model and establish convergence at rate O(1/sqrt(D)), yielding the above bound by a triangle inequality. To handle the rich probabilistic structure of the limit dynamics and obtain one of the first rigorous quantitative convergence for a DMFT-type limit, we combine the cavity method with propagation of chaos arguments at a functional level on so-called skeleton maps, which express the weight updates as functions of CLT-type sums from the past.
- [4] arXiv:2603.18190 [pdf, ps, other]
-
Title: Starting Off on the Wrong Foot: Pitfalls in Data PreparationComments: 42 pages, 37 referencesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
When working with real-world insurance data, practitioners often encounter challenges during the data preparation stage that can undermine the statistical validity and reliability of downstream modeling. This study illustrates that conventional data preparation procedures such as random train-test partitioning, often yield unreliable and unstable results when confronted with highly imbalanced insurance loss data. To mitigate these limitations, we propose a novel data preparation framework leveraging two recent statistical advancements: support points for representative data splitting to ensure distributional consistency across partitions, and the Chatterjee correlation coefficient for initial, non-parametric feature screening to capture feature relevance and dependence structure. We further integrate these theoretical advances into a unified, efficient framework that also incorporates missing-data handling, and embed this framework within our custom InsurAutoML pipeline. The performance of the proposed approach is evaluated using both simulated datasets and datasets often cited in the academic literature. Our findings definitively demonstrate that incorporating statistically rigorous data preparation methods not only significantly enhances model robustness and interpretability but also substantially reduces computational resource requirements across diverse insurance loss modeling tasks. This work provides a crucial methodological upgrade for achieving reliable results in high stakes insurance applications.
- [5] arXiv:2603.18204 [pdf, ps, other]
-
Title: Highly Adaptive Empirical Risk Minimization with Principal ComponentsSubjects: Statistics Theory (math.ST)
The Highly Adaptive Lasso (HAL) delivers unprecedented guarantees in nonparametric minimum loss estimation under minimal smoothness assumptions, such as dimension-free minimax optimal rates. However, the practical use of HAL has been severely limited by its exponentially growing computationally prohibitive indicator basis expansion in moderate to high dimensions. Existing screening strategies drastically reduce this dimension but lack any theoretical justification. We introduce the Principal Component Highly Adaptive (PC-HA) family of estimators, which for the first time provide a principled and theoretically valid dimension reduction. We establish formal results on the score equations solved by these PC-HA estimators, allowing to transfer plug-in efficiency and pointwise asymptotic normality results from HAL to these PC-HA estimators, under comparable complexity control.
- [6] arXiv:2603.18225 [pdf, ps, other]
-
Title: A Hybrid Conditional Diffusion-DeepONet Framework for High-Fidelity Stress Prediction in Hyperelastic MaterialsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Predicting stress fields in hyperelastic materials with complex microstructures remains challenging for traditional deep learning surrogates, which struggle to capture both sharp stress concentrations and the wide dynamic range of stress magnitudes. Convolutional architectures such as UNet tend to oversmooth high-frequency gradients, while neural operators like DeepONet exhibit spectral bias and underpredict localized extremes. Diffusion models can recover fine-scale structure but often introduce low-frequency amplitude drift, degrading physical scaling. To address these limitations, we propose a hybrid surrogate framework, cDDPM-DeepONet, that decouples stress morphology from magnitude. A conditional denoising diffusion probabilistic model (cDDPM), built on a UNet backbone, generates normalized von Mises stress fields conditioned on geometry and loading. In parallel, a modified DeepONet predicts global scaling parameters (minimum and maximum stress), enabling reconstruction of full-resolution physical stress maps. This separation allows the diffusion model to focus on spatial structure while the operator network corrects global amplitude, mitigating spectral and scaling biases. We evaluate the framework on nonlinear hyperelastic datasets with single and multiple polygonal voids. The proposed model consistently outperforms UNet, DeepONet, and standalone cDDPM baselines by one to two orders of magnitude. Spectral analysis shows strong agreement with finite element solutions across all wavenumbers, preserving both global behavior and localized stress concentrations.
- [7] arXiv:2603.18279 [pdf, ps, other]
-
Title: Covariate-Dependent Functional Principal Component Analysis for SHMComments: 10 pages, 3 figures, conferenceSubjects: Methodology (stat.ME); Applications (stat.AP)
In Structural Health Monitoring (SHM), sensor measurements and derived features such as eigenfrequencies often exhibit systematic daily patterns and can therefore be naturally represented as functional data. Furthermore, these patterns are typically influenced by environmental factors, particularly temperature, which can substantially affect the observed system response. While most existing methods for removing environmental effects assume that confounding influences affect only the mean response, it has been shown that environmental and operational factors may also alter the covariance structure of the residual process. To address this limitation in a functional data monitoring framework, we incorporate so-called covariate-dependent functional principal component analysis (CD-FPCA), which allows eigenfunctions and eigenvalues of the residual process to vary smoothly with covariates such as temperature. The proposed methodology is illustrated using an extended version of the KW51 railway bridge eigenfrequency dataset. This case study suggests that accounting for covariate effects beyond the functional mean can improve the robustness of the monitoring procedure, in particular by reducing environmentally induced (false) alarms under challenging low-temperature conditions.
- [8] arXiv:2603.18311 [pdf, ps, other]
-
Title: Minimax Optimal Estimation of Mean and Covariance Functions with Spectral RegularizationSubjects: Statistics Theory (math.ST)
Estimation of the mean and covariance functions is a fundamental problem in functional data analysis, particularly for discretely observed functional data. In this work, we study a regularization-based framework for estimating the mean and the covariance functions within a reproducing kernel Hilbert space (RKHS) setting. Our approach utilizes a spectral regularization technique under H\"{o}lder-type source conditions, allowing for a broad class of regularization schemes and accommodating a wide range of smoothness assumptions on the target functions. Unlike previous works in the literature, the proposed work does not require the target functions to belong to the underlying RKHS. Convergence rates for the proposed estimators are derived, and optimality is established by obtaining matching minimax lower bounds.
- [9] arXiv:2603.18324 [pdf, ps, other]
-
Title: Bridging Theory and Practice in Efficient Gaussian Process-Based Statistical Modeling for Large DatasetsSubjects: Computation (stat.CO)
Geostatistics is a branch of statistics concerned with stochastic processes over continuous domains, with Gaussian processes (GPs) providing a flexible and principled modelling framework. However, the high computational cost of simulating or computing likelihoods with GPs limits their scalability to large datasets. This paper introduces the piecewise continuous Gaussian process (PCGP), a new process that retains the rich probabilistic structure of traditional GPs while offering substantial computational efficiency. As will be shown and discussed, existing scalable approaches that define stochastic processes on continuous domains -- such as the nearest-neighbour GP (NNGP) and the radial-neighbour GP (RNGP) -- rely on conditional independence structures that effectively constrain the measurable space on which the processes are defined, which may induce undesirable probabilistic behaviour and compromise their practical applicability, particularly in complex latent GP models. The PCGP mitigates these limitations and provides a theoretically grounded and computationally efficient alternative, as demonstrated through numerical illustrations.
- [10] arXiv:2603.18345 [pdf, ps, other]
-
Title: Synthetic Data, Information, and Prior Knowledge: Why Synthetic Data Augmentation to Boost Sample Doesn't Work for Statistical InferenceComments: Draft; feedback welcomeSubjects: Methodology (stat.ME)
The use of synthetic data to deidentify data and to improve predictive models is well-attested to. The augmentation of datasets using synthetically generated data is an alluring proposition: in the best case, it generates realistic data \textit{in silico} at a fraction of the cost of authentic data which may be found \textit{in vivo} or \textit{in vitro}. This poses novel epistemic challenges.
We contend that synthetic data augmentation is best understood as a novel way of accounting for prior knowledge. In this manuscript, we propose a definition of synthetic distributions and analyze how synthetic data augmentation interplays with standard accounts of maximum likelihood and Bayesian estimation. We observe that the marginal Fisher information contributed by synthetic data processes is subject to fundamental bounds, and enumerate obstacles to the use of synthetic data augmentation to aid in inferential tasks.
We then articulate a Bayesian formulation of the way that synthetic data augmentation can be coherently understood, but argue that naive approaches to the specification of the prior are epistemically unjustifiable. This suggests that enhanced scrutiny must be placed on identifying justifiable priors to warrant the use and inclusion of data drawn from specific synthetic distributions.
While our analysis shows the challenges and limitations of using synthetic data augmentation to improve upon traditional statistical model reasoning, it does suggest that augmentation is the principal approach analysts using outcome reasoning (i.e. using train/test splits to justify the analysis) can constrain an otherwise high-dimensional model space, providing an alternative to trying to encode the constraints into the potentially complex architecture of the algorithm. - [11] arXiv:2603.18378 [pdf, ps, other]
-
Title: BiSSLB: Binary Spike-and-Slab Lasso BiclusteringSubjects: Methodology (stat.ME)
Biclustering is a powerful unsupervised learning technique for simultaneously identifying coherent subsets of rows and columns in a data matrix, thus revealing local patterns that may not be apparent in global analyses. However, most biclustering methods are developed for continuous data and are not applicable for binary datasets such as single-nucleotide polymorphism (SNP) or protein-protein interaction (PPI) data. Existing biclustering algorithms for binary data often struggle to recover biclustering patterns under noise, face scalability issues, and/or bias the final results towards biclusters of a particular size or characteristic. We propose a Bayesian method for biclustering binary datasets called Binary Spike-and-Slab Lasso Biclustering (BiSSLB). Our method is robust to noise and allows for overlapping biclusters of various sizes without prior knowledge of the noise level or bicluster characteristics. BiSSLB is based on a logistic matrix factorization model with spike-and-slab priors on the latent spaces. We further incorporate an Indian Buffet Process (IBP) prior to automatically determine the number of biclusters from the data. We develop a novel coordinate ascent algorithm with proximal steps which allows for scalable computation. The performance of our proposed approach is assessed through simulations and two real applications on HapMap SNP and Homo Sapiens PPI data, where BiSSLB is shown to outperform other state-of-the-art binary biclustering methods when the data is very noisy.
- [12] arXiv:2603.18404 [pdf, ps, other]
-
Title: Multi-Domain Causal Empirical Bayes Under Linear MixingSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
Causal representation learning (CRL) aims to learn low-dimensional causal latent variables from high-dimensional observations. While identifiability has been extensively studied for CRL, estimation has been less explored. In this paper, we explore the use of empirical Bayes (EB) to estimate causal representations. In particular, we consider the problem of learning from data from multiple domains, where differences between domains are modeled by interventions in a shared underlying causal model. Multi-domain CRL naturally poses a simultaneous inference problem that EB is designed to tackle. Here, we propose an EB $f$-modeling algorithm that improves the quality of learned causal variables by exploiting invariant structure within and across domains. Specifically, we consider a linear measurement model and interventional priors arising from a shared acyclic SCM. When the graph and intervention targets are known, we develop an EM-style algorithm based on causally structured score matching. We further discuss EB $\rmg$-modeling in the context of existing CRL approaches. In experiments on synthetic data, our proposed method achieves more accurate estimation than other methods for CRL.
- [13] arXiv:2603.18413 [pdf, ps, other]
-
Title: Statistical Testing Framework for Clustering Pipelines by Selective InferenceComments: 59 pages, 11 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating multiple analysis algorithms.In many practical applications, analytical findings are obtained only after data pass through several data-dependent procedures within such pipelines.In this study, we address the problem of quantifying the statistical reliability of results produced by data analysis pipelines.As a proof of concept, we focus on clustering pipelines that identify cluster structures from complex and heterogeneous data through procedures such as outlier detection, feature selection, and clustering.We propose a novel statistical testing framework to assess the significance of clustering results obtained through these pipelines.Our framework, based on selective inference, enables the systematic construction of valid statistical tests for clustering pipelines composed of predefined components.We prove that the proposed test controls the type I error rate at any nominal level and demonstrate its validity and effectiveness through experiments on synthetic and real datasets.
- [14] arXiv:2603.18483 [pdf, ps, other]
-
Title: Precise Performance of Linear Denoisers in the Proportional RegimeSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
In the present paper we study the performance of linear denoisers for noisy data of the form $\mathbf{x} + \mathbf{z}$, where $\mathbf{x} \in \mathbb{R}^d$ is the desired data with zero mean and unknown covariance $\mathbf{\Sigma}$, and $\mathbf{z} \sim \mathcal{N}(0, \mathbf{\Sigma}_{\mathbf{z}})$ is additive noise. Since the covariance $\mathbf{\Sigma}$ is not known, the standard Wiener filter cannot be employed for denoising. Instead we assume we are given samples $\mathbf{x}_1,\dots,\mathbf{x}_n \in \mathbb{R}^d$ from the true distribution. A standard approach would then be to estimate $\mathbf{\Sigma}$ from the samples and use it to construct an ``empirical" Wiener filter. However, in this paper, motivated by the denoising step in diffusion models, we take a different approach whereby we train a linear denoiser $\mathbf{W}$ from the data itself. In particular, we synthetically construct noisy samples $\hat{\mathbf{x}}_i$ of the data by injecting the samples with Gaussian noise with covariance $\mathbf{\Sigma}_1 \neq \mathbf{\Sigma}_{\mathbf{z}}$ and find the best $\mathbf{W}$ that approximates $\mathbf{W}\hat{\mathbf{x}}_i \approx \mathbf{x}_i$ in a least-squares sense. In the proportional regime $\frac{n}{d} \rightarrow \kappa > 1$ we use the {\it Convex Gaussian Min-Max Theorem (CGMT)} to analytically find the closed form expression for the generalization error of the denoiser obtained from this process. Using this expression one can optimize over $\mathbf{\Sigma}_1$ to find the best possible denoiser. Our numerical simulations show that our denoiser outperforms the ``empirical" Wiener filter in many scenarios and approaches the optimal Wiener filter as $\kappa\rightarrow\infty$.
- [15] arXiv:2603.18490 [pdf, ps, other]
-
Title: The minimax optimal convergence rate of posterior density in the weighted orthogonal polynomialsComments: 27 pages, 2 figures, 1 supplementary material (11 pages)Subjects: Statistics Theory (math.ST)
We investigate Bayesian nonparametric density estimation via orthogonal polynomial expansions in weighted Sobolev spaces. A core challenge is establishing minimax optimal posterior convergence rates, especially for densities on unbounded domains without a strictly positive lower bound. For densities bounded away from zero, we give sufficient conditions under which the framework of \cite{shen2001} applies directly. For densities lacking a positive lower bound, the equivalence between Hellinger and weighted $L_2$-norm distance fails, invalidating the original theory. We propose a novel shifting method that lifts the true density $g_0$ to a sequence of proxy densities $g_{0,n}$. We prove a modified convergence theorem applicable to these shifted densities, preserving the optimal rate. We also construct a Gaussian sieve prior that achieves the minimax rate $\varepsilon_n=n^{-p/(2p+1)}$ for any integer $p\geq1$. Numerical results confirm that our estimator approximates the true density well and validates the theoretical convergence rate.
- [16] arXiv:2603.18506 [pdf, ps, other]
-
Title: Approximation by mixtures of multivariate Erlang distributionsAuthors: Hien Duy NguyenSubjects: Statistics Theory (math.ST)
We prove that finite multivariate Erlang mixture densities with a common rate parameter are dense in the class of probability densities on $\mathbb{R}_{+}^{d}$ that belong to $L^{p}$, for every dimension $d\in\mathbb{N}$ and every $1\le p<\infty$. The argument is constructive: the one-dimensional Sz\'asz--Mirakjan--Kantorovich operator yields Erlang mixture approximations, and its tensor product yields multivariate approximants with a common scale. We then obtain several quantitative consequences. These include compact-set uniform approximation bounds and, under local H\"older conditions of order $\alpha\in(0,1]$, rates of order $n^{-\alpha/2}$ as the common scale $1/n$ tends to zero, whole-domain convergence in weighted sup norms, weighted and unweighted $L^{p}$ rates, and explicit rates for finite mixtures indexed by the number of mixture components. In particular, if the approximating density is required to have at most $K$ mixture components, then on fixed compact cubes we obtain an algebraic rate of order $K^{-\alpha/(2d)}$; in global weighted sup norms we obtain the explicit algebraic component-count rate $K^{-\alpha/[2d(2d+\alpha)]}$; and for $1<p<\infty$ we obtain corresponding weighted $L^{p}$ component-count rates. The results strengthen the weak-approximation theory for multivariate Erlang mixture distributions and yield immediate corollaries for broader classes such as product-gamma mixtures. \noindent\textbf{Keywords:} multivariate Erlang mixtures; Erlang distributions; Sz\'asz--Mirakjan--Kantorovich operator; density approximation; weighted $L^{p}$ approximation; approximation rates.
- [17] arXiv:2603.18514 [pdf, ps, other]
-
Title: On the Peril of (Even a Little) Nonstationarity in Satisficing Regret MinimizationComments: 21 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Motivated by the principle of satisficing in decision-making, we study satisficing regret guarantees for nonstationary $K$-armed bandits. We show that in the general realizable, piecewise-stationary setting with $L$ stationary segments, the optimal regret is $\Theta(L\log T)$ as long as $L\geq 2$. This stands in sharp contrast to the case of $L=1$ (i.e., the stationary setting), where a $T$-independent $\Theta(1)$ satisficing regret is achievable under realizability. In other words, the optimal regret has to scale with $T$ even if just a little nonstationarity presents. A key ingredient in our analysis is a novel Fano-based framework tailored to nonstationary bandits via a \emph{post-interaction reference} construction. This framework strictly extends the classical Fano method for passive estimation as well as recent interactive Fano techniques for stationary bandits. As a complement, we also discuss a special regime in which constant satisficing regret is again possible.
- [18] arXiv:2603.18590 [pdf, ps, other]
-
Title: Sometimes nonparametrics beat parametrics, even when the model is rightComments: 18 pages, 2 figures; Statistical Research Report, Department of Mathematics, University of Oslo, October 1996, but now arXiv'd March 2026Subjects: Statistics Theory (math.ST)
A basic issue in both teaching of and practice of statistics is the interplay between modelling assumptions and inference performance. The general message conveyed is that stronger assumptions lead to better statistical performance of the relevant estimators, tests and confidence intervals, provided that these assumptions hold. On the other hand, fewer assumptions often lead to safer and more robust methods that are good also outside narrow conditions, but not quite as good as specialist methods that exploit such narrower conditions, if these are fulfilled.
This interplay is nicely illustrated in the context of density estimation, where parametric and nonparametric methods can be contrasted. The parametric ones have mean squared errors of size $O(n^{-1})$ in terms of sample size $n$ if the parametric model is right, but are not even consistent outside the model. The nonparametric methods are everywhere consistent and have mean squared errors of size $O(n^{-4/5})$ for broad classes of estimands.
The point we are making here is that this picture is not universally true! We show that a simple kernel density estimator can perform better than a directly estimated parametric density on the latter's home turf, for small sample sizes, in the sense of mean integrated squared error. Our main example is that of estimating an unknown normal density. In the process of developing and discussing this somewhat counter-intuitive and half-paradoxical example we touch on several tangential issues of interest, pertaining to exact small-sample analysis of density estimators. - [19] arXiv:2603.18640 [pdf, ps, other]
-
Title: A Theoretical Comparison of No-U-Turn Sampler Variants: Necessary and Su?cient Convergence Conditions and Mixing Time Analysis under Gaussian TargetsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
The No-U-Turn Sampler (NUTS) is the computational workhorse of modern Bayesian software libraries, yet its qualitative and quantitative convergence guarantees were established only recently. A significant gap remains in the theoretical comparison of its two main variants: NUTS-mul and NUTS-BPS, which use multinomial sampling and biased progressive sampling, respectively, for index selection. In this paper, we address this gap in three contributions. First, we derive the first necessary conditions for geometric ergodicity for both variants. Second, we establish the first sufficient conditions for geometric ergodicity and ergodicity for NUTS-mul. Third, we obtain the first mixing time result for NUTS-BPS on a standard Gaussian distribution. Our results show that NUTS-mul and NUTS-BPS exhibit nearly identical qualitative behavior, with geometric ergodicity depending on the tail properties of the target distribution. However, they differ quantitatively in their convergence rates. More precisely, when initialized in the typical set of the canonical Gaussian measure, the mixing times of both NUTS-mul and NUTS-BPS scale as $O(d^{1/4})$ up to logarithmic factors, where $d$ denotes the dimension. Nevertheless, the associated constants are strictly smaller for NUTS-BPS.
- [20] arXiv:2603.18781 [pdf, ps, other]
-
Title: SRRM: Improving Recursive Transport Surrogates in the Small-Discrepancy RegimeComments: 29 pages,20 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
Recursive partitioning methods provide computationally efficient surrogates for the Wasserstein distance, yet their statistical behavior and their resolution in the small-discrepancy regime remain insufficiently understood. We study Recursive Rank Matching (RRM) as a representative instance of this class under a population-anchored reference. In this setting, we establish consistency and an explicit convergence rate for the anchored empirical RRM under the quadratic cost. We then identify a dominant mismatch mechanism responsible for the loss of resolution in the small-discrepancy regime. Based on this analysis, we introduce Selective Recursive Rank Matching (SRRM), which suppresses the resulting dominant mismatches and yields a higher-fidelity practical surrogate for the Wasserstein distance at moderate additional computational cost.
- [21] arXiv:2603.18833 [pdf, ps, other]
-
Title: Estimation of Functional Principal Components from Sparse Functional DataAuthors: Uche Mbaka (1), Jiguo Cao (2), Michelle Carey (1) ((1) University College Dublin, (2) Simon Fraser University)Subjects: Methodology (stat.ME); Computation (stat.CO)
Sparse functional data arise when measurements are observed infrequently and at irregular time points for each subject, often in the presence of measurement error. These characteristics introduce additional challenges for functional principal component analysis. In this paper, we propose a new approach for extracting functional principal components from such data by combining basis expansion with maximum likelihood estimation. Orthogonality of the estimated eigenfunctions is preserved throughout the optimization using modified Gram-Schmidt orthonormalization. An information criterion is proposed to select both the optimal number of basis functions and the rank of the covariance structure. Principal component scores are subsequently estimated via conditional expectation, enabling accurate reconstruction of the underlying functional trajectories across the full domain despite sparse observations. Simulation studies demonstrate the effectiveness of the proposed method and show that it performs favorably compared with existing approaches. Its practical utility is illustrated through applications to CD4 cell count data from the Multicenter AIDS Cohort Study and somatic cell count data from Irish research dairy cattle. Supplementary materials, including technical details, additional simulation results, and the R package mGSFPCA, are available online.
- [22] arXiv:2603.18845 [pdf, ps, other]
-
Title: Preconditioning Hamiltonian Monte Carlo by minimizing Fisher DivergenceSubjects: Computation (stat.CO)
Although Hamiltonian Monte Carlo (HMC) scales as O(d^(1/4)) in dimension, there is a large constant factor determined by the curvature of the target density. This constant factor can be reduced in most cases through preconditioning, the state of the art for which uses diagonal or dense penalized maximum likelihood estimation of (co)variance based on a sample of warmup draws. These estimates converge slowly in the diagonal case and scale poorly when expanded to the dense case. We propose a more effective estimator based on minimizing the sample Fisher divergence from a linearly transformed density to a standard normal distribution. We present this estimator in three forms, (a) diagonal, (b) dense, and (c) low-rank plus diagonal. Using a collection of 114 models from posteriordb, we demonstrate that the diagonal minimizer of Fisher divergence outperforms the industry-standard variance-based diagonal estimators used by Stan and PyMC by a median factor of 1.3. The low-rank plus diagonal minimizer of the Fisher divergence outperforms Stan and PyMC's diagonal estimators by a median factor of 4.
- [23] arXiv:2603.18928 [pdf, ps, other]
-
Title: A Bayesian Reinterpretation of Cornfield-Type Sensitivity Analysis: From Thresholds to ProbabilitiesAuthors: Tommaso CostaSubjects: Other Statistics (stat.OT)
Sensitivity analysis for unmeasured confounding in observational studies is commonly based on threshold quantities, such as the Cornfield condition or the E-value, which quantify how strong a confounder must be to explain away an observed association. However, these approaches do not address a fundamental inferential question: how plausible is it that such a confounder exists? In this work, we propose a Bayesian reformulation of Cornfield-type sensitivity analysis in which the strength of unmeasured confounding is treated as a random variable. Within this framework, the E-value is reinterpreted as a threshold, and the central inferential quantity becomes the posterior probability that confounding exceeds this threshold. This transforms sensitivity analysis from a descriptive diagnostic into a probabilistic assessment of robustness. We develop a simple generative model linking observed effect estimates to true causal effects and confounding bias, and we specify prior distributions reflecting plausible confounding mechanisms. The resulting framework yields posterior measures of evidential vulnerability that are directly interpretable and applicable to summary-level data. Illustrations based on empirical case studies show that the proposed approach preserves the interpretability of the E-value while providing a more nuanced and decision-relevant characterization of robustness. More broadly, the framework aligns sensitivity analysis with Bayesian principles of inference under uncertainty, offering a coherent alternative to purely threshold-based reasoning.
- [24] arXiv:2603.18938 [pdf, ps, other]
-
Title: Kernel Single-Index Bandits: Estimation, Inference, and LearningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We study contextual bandits with finitely many actions in which the reward of each arm follows a single-index model with an arm-specific index parameter and an unknown nonparametric link function. We consider a regime in which arms correspond to stable decision options and covariates evolve adaptively under the bandit policy. This setting creates significant statistical challenges: the sampling distribution depends on the allocation rule, observations are dependent over time, and inverse-propensity weighting induces variance inflation. We propose a kernelized $\varepsilon$-greedy algorithm that combines Stein-based estimation of the index parameters with inverse-propensity-weighted kernel ridge regression for the reward functions. This approach enables flexible semiparametric learning while retaining interpretability. Our analysis develops new tools for inference with adaptively collected data. We establish asymptotic normality for the single-index estimator under adaptive sampling, yielding valid confidence regions, and derive a directional functional central limit theorem for the RKHS estimator, which provides asymptotically valid pointwise confidence intervals. The analysis relies on concentration bounds for inverse-weighted Gram matrices together with martingale central limit theorems. We further obtain finite-time regret guarantees, including $\tilde{O}(\sqrt{T})$ rates under common-link Lipschitz conditions, showing that semiparametric structure can be exploited without sacrificing statistical efficiency. These results provide a unified framework for simultaneous learning and inference in single-index contextual bandits.
- [25] arXiv:2603.18941 [pdf, ps, other]
-
Title: Unified Taxonomy for Multivariate Time Series Anomaly Detection using Deep LearningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The topic of Multivariate Time Series Anomaly Detection (MTSAD) has grown rapidly over the past years, with a steady rise in publications and Deep Learning (DL) models becoming the dominant paradigm. To address the lack of systematization in the field, this study introduces a novel and unified taxonomy with eleven dimensions over three parts (Input, Output and Model) for the categorization of DL-based MTSAD methods. The dimensions were established in a two-fold approach. First, they derived from a comprehensive analysis of methodological studies. Second, insights from review papers were incorporated. Furthermore, the proposed taxonomy was validated using an additional set of recent publications, providing a clear overview of methodological trends in MTSAD. Results reveal a convergence toward Transformer-based and reconstruction and prediction models, setting the foundation for emerging adaptive and generative trends. Building on and complementing existing surveys, this unified taxonomy is designed to accommodate future developments, allowing for new categories or dimensions to be added as the field progresses. This work thus consolidates fragmented knowledge in the field and provides a reference point for future research in MTSAD.
- [26] arXiv:2603.18985 [pdf, ps, other]
-
Title: Revisiting OmniAnomaly for Anomaly Detection: performance metrics and comparison with PCA-based modelsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Deep learning models have become the dominant approach for multivariate time series anomaly detection (MTSAD), often reporting substantial performance improvements over classical statistical methods. However, these gains are frequently evaluated under heterogeneous thresholding strategies and evaluation protocols, making fair comparisons difficult. This work revisits OmniAnomaly, a widely used stochastic recurrent model for MTSAD, and systematically compares it with a simple linear baseline based on Principal Component Analysis (PCA) on the Server Machine Dataset (SMD). Both methods are evaluated under identical thresholding and evaluation procedures, with experiments repeated across 100 runs for each of the 28 machines in the dataset. Performance is evaluated using Precision, Recall and F1-score at point-level, with and without point-adjustment, and under different aggregation strategies across machines and runs, with the corresponding standard deviations also reported. The results show large variability across machines and show that PCA can achieve performance comparable to OmniAnomaly, and even outperform it when point-adjustment is not applied. These findings question the added value of more complex architectures under current benchmarking practices and highlight the critical role of evaluation methodology in MTSAD research.
- [27] arXiv:2603.18990 [pdf, ps, other]
-
Title: Distributed lag non-linear models with spatial effect modification using Laplacian P-splinesSubjects: Methodology (stat.ME)
Distributed lag non-linear models (DLNMs) are a popular approach to flexibly model the effect of time-delayed exposures. Classical DLNMs specify a common exposure-lag-response relationship across geographical areas. However, this relationship might be altered by an effect modifier that differs between spatial units. Although some methods have been proposed to account for effect modification, their applicability is context-dependent. For example, a meta-analysis can account for heterogeneity between groups, but this technique requires sufficiently large study groups. This limitation is particularly relevant when working with count data, where small numbers of events are often encountered. In this paper, we review existing methods that allow for spatial effect modification for count-based outcomes and propose a Bayesian DLNM alternative method that accounts for the modifier through flexible interaction effects. Through the use of Laplacian P-splines, we provide a computationally fast estimation procedure by avoiding the use of classical Markov Chain Monte Carlo (MCMC) approaches. The performance of the different methods is evaluated through simulation studies. Moreover, the practical applicability of our proposed method is showcased through a data application, containing daily temperature and mortality count data in 87 Italian cities.
- [28] arXiv:2603.19041 [pdf, ps, other]
-
Title: Fast and Interpretable Autoregressive Estimation with Neural Network BackpropagationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Autoregressive (AR) models remain widely used in time series analysis due to their interpretability, but convencional parameter estimation methods can be computationally expensive and prone to convergence issues. This paper proposes a Neural Network (NN) formulation of AR estimation by embedding the autoregressive structure directly into a feedforward NN, enabling coefficient estimation through backpropagation while preserving interpretability. Simulation experiments on 125,000 synthetic AR(p) time series with short-term dependence (1 <= p <= 5) show that the proposed NN-based method consistently recovers model coefficients for all series, while Conditional Maximum Likelihood (CML) fails to converge in approximately 55% of cases. When both methods converge, estimation accuracy is comparable with negligible differences in relative error, R2 and, perplexity/likelihood. However, when CML fails, the NN-based approach still provides reliable estimates. In all cases, the NN estimator achieves substantial computational gains, reaching a median speedup of 12.6x and up to 34.2x for higher model orders. Overall, results demonstrate that gradient-descent NN optimization can provide a fast and efficient alternative for interpretable AR parameter estimation.
- [29] arXiv:2603.19051 [pdf, ps, other]
-
Title: Optimal Sample Size Calculation in Cost-Effectiveness Longitudinal Cluster Randomized TrialsAuthors: Hao Wang, Jingxia Liu, Drew B. Cameron, Jiaqi Tong, Donna Spiegelman, Daniella Meeker, Fan LiSubjects: Methodology (stat.ME)
Longitudinal cluster randomized trials (L-CRTs) are increasingly used to evaluate the cost-effectiveness of healthcare interventions across multiple assessment periods, yet design methods for powering these trials remain underdeveloped. Existing methods for cost-effectiveness analyses in cluster settings are limited to simple parallel-arm cluster randomized trials with a single follow-up assessment period. These methods cannot accommodate the complex correlation structures in L-CRTs conducted over multiple periods, which require differentiation between within-period and between-period correlations for both clinical and cost outcomes, as well as between-outcome correlations. Moreover, while substantial methodological advances have been made for the design of L-CRTs with univariate outcomes, none specifically address cost-effectiveness objectives where clinical and cost outcomes must be jointly modeled. We provide a design-stage framework for powering cost-effectiveness L-CRTs across three design variants: parallel-arm, crossover, and stepped wedge designs. We derive closed-form variance expressions for the generalized least squares estimator of the average incremental net monetary benefit under a bivariate linear mixed model. We propose a standardized ceiling ratio that adjusts willingness-to-pay for relative outcome variability to inform optimal design. We then develop local optimal designs that maximize statistical power under known correlation parameters and MaxiMin designs that ensure robust performance across parameter uncertainty for all three design variants. Through a real stepped wedge trial data example, we demonstrate the sample size calculation for testing intervention cost-effectiveness under local optimal and MaxiMin designs.
- [30] arXiv:2603.19055 [pdf, ps, other]
-
Title: Probabilistic multivariate statistical process control via kernel parameter uncertainty propagationAuthors: Zina-Sabrina Duma, Victoria Jorry, Ayesha Safraz, Maria Paola di Crosta, Tuomas Sihvonen, Lassi Roininen, Satu-Pia ReinikainenSubjects: Applications (stat.AP)
Kernel-based multivariate statistical process control (K-MSPC) extends classical monitoring to nonlinear industrial processes. Its performance depends critically on kernel parameters such as lengthscales and variance terms. In current practice these parameters are typically selected by heuristics or deterministic optimisation, and then treated as fixed, despite being inferred from finite and noisy data. This can lead to overconfident control limits and unstable alarm behaviour when the kernel choice is uncertain. This work proposes a probabilistic K-MSPC framework that quantifies and propagates kernel parameter uncertainty to the monitoring statistics. The approach follows a two-stage workflow: (i) deterministic kernel calibration using supervised or unsupervised models, and (ii) Bayesian inference of kernel parameters via Markov chain Monte Carlo. Posterior samples are propagated through kernel Principal Component Analysis to produce probabilistic $T^2$ and squarred prediction error control charts, together with uncertainty-aware contribution plots. The framework is evaluated on the Tennessee Eastman Process benchmark. Results show that posterior-mean monitoring often improves fault detection compared to deterministic prior-mean charts for the squared exponential kernel, while credible bands remain narrow in-control and widen under faults, reflecting amplified epistemic uncertainty in abnormal regimes. The automatic relevance determination kernel reduces posterior uncertainty and yields performance close to the deterministic baseline, whereas unsupervised calibration produces wider posterior bands but still robust fault detection.
- [31] arXiv:2603.19058 [pdf, ps, other]
-
Title: Adaptive Nonlinear Data Assimilation through P-Spline Triangular Measure TransportComments: 24 pages, 10 figuresSubjects: Computation (stat.CO); Atmospheric and Oceanic Physics (physics.ao-ph); Methodology (stat.ME); Machine Learning (stat.ML)
Non-Gaussian statistics are a challenge for data assimilation. Linear methods oversimplify the problem, yet fully nonlinear methods are often too expensive to use in practice. The best solution usually lies between these extremes. Triangular measure transport offers a flexible framework for nonlinear data assimilation. Its success, however, depends on how the map is parametrized. Too much flexibility leads to overfitting; too little misses important structure. To address this balance, we develop an adaptation algorithm that selects a parsimonious parametrization automatically. Our method uses P-spline basis functions and an information criterion as a continuous measure of model complexity. This formulation enables gradient descent and allows efficient, fine-scale adaptation in high-dimensional settings. The resulting algorithm requires no hyperparameter tuning. It adjusts the transport map to the appropriate level of complexity based on the system statistics and ensemble size. We demonstrate its performance in nonlinear, non-Gaussian problems, including a high-dimensional distributed groundwater model.
- [32] arXiv:2603.19073 [pdf, ps, other]
-
Title: Finite-sample bounds for multi-output system identificationComments: Submitted for review to IEEE Transactions on Automatic ControlSubjects: Statistics Theory (math.ST); Dynamical Systems (math.DS)
This paper presents uniform-in-time finite-sample bounds for regularized linear regression with vector-valued outputs and conditionally zero-mean subgaussian noise. By revisiting classical self-normalized martingale arguments, we obtain bounds that apply directly to multi-output regression, unlike most of the prior work. Compared to the state of the art, the new results are more general and yield tighter bounds, even for scalar-valued outputs. The mild assumptions we use allow for unknown dependencies between regressors and past noise terms, typically induced by system dynamics or feedback mechanisms. Therefore, these novel finite-sample bounds can be applied to many affine-in-parameter system identification problems, including the identification of a linear time-invariant system from full-state measurements. These new results may lead to significant improvements in stochastic learning-based controllers for safety-critical applications.
- [33] arXiv:2603.19143 [pdf, ps, other]
-
Title: The Uncertain Policy Price of Scaling Direct Air CaptureAuthors: Leonardo Chiani, Pietro Andreoni, Laurent Drouet, Tobias Schmidt, Katrin Sievert, Bjerne Steffen, Massimo TavoniSubjects: Applications (stat.AP)
Direct air carbon capture and storage (DACCS) is a promising CO2 removal technology, but its deployment at scale remains speculative. Yet, its technological, economic, and policy-related uncertainties have often been overlooked in mitigation pathways. This paper conducts the first uncertainty quantification and global sensitivity analysis of DACCS on technological, market, financial and public support drivers, using a detailed-process Integrated Assessment Model and newly developed sensitivity algorithms. We find that DACCS deployment exhibits a fat-tailed distribution: most scenarios show modest technology uptake, but there is a small but non-zero probability (4-6%) of achieving gigaton-scale removals by mid-century. Scaling DACCS to gigaton levels requires subsidies that always exceed 200-330 USD/tCO2 and are sustained for decades, resulting in a public support programme of 900-3000 USD Billions. Such an effort pays back by mid-century, but only if accompanied by strong emission reduction policies. These findings highlight the critical role of climate policies in enabling a robust and economically sustainable CO2 removal strategy.
- [34] arXiv:2603.19160 [pdf, ps, other]
-
Title: PPI is the Difference Estimator: Recognizing the Survey Sampling Roots of Prediction-Powered InferenceAuthors: Reagan MozerSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
Prediction-powered inference (PPI) is a rapidly growing framework for combining machine learning predictions with a small set of gold-standard labels to conduct valid statistical inference. In this article, I argue that the core estimators underlying PPI are equivalent to well-established estimators from the survey sampling literature dating back to the 1970s. Specifically, the PPI estimator for a population mean is algebraically equivalent to the difference estimator of Cassel et al. (1976), and PPI plus corresponds to the generalized regression (GREG) estimator of Sarndal et al. (2003). Recognizing this equivalence, I consider what part of PPI is inherited from a long-standing literature in statistics, what part is genuinely new, and where inferential claims require care. After introducing the two frameworks and establishing their equivalence, I break down where PPI diverges from model-assisted estimation, including differences in the mode of inference, the role of the unlabeled data pool, and the consequences of differential prediction error for subgroup estimands such as the average treatment effect. I then identify what each framework offers the other: PPI researchers can draw on the survey sampling literature's well-developed theory of calibration, optimal allocation, and design-based diagnostics, while survey sampling researchers can benefit from PPI's extensions to non-standard estimands and its accessible software ecosystem. The article closes with a call for integration between these two communities, motivated by the growing use of large language models as measurement instruments in applied research.
- [35] arXiv:2603.19198 [pdf, ps, other]
-
Title: The Exponentially Weighted SignatureComments: 43 pages, 1 figureSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The signature is a canonical representation of a multidimensional path over an interval. However, it treats all historical information uniformly, offering no intrinsic mechanism for contextualising the relevance of the past. To address this, we introduce the Exponentially Weighted Signature (EWS), generalising the Exponentially Fading Memory (EFM) signature from diagonal to general bounded linear operators. These operators enable cross-channel coupling at the level of temporal weighting together with richer memory dynamics including oscillatory, growth, and regime-dependent behaviour, while preserving the algebraic strengths of the classical signature. We show that the EWS is the unique solution to a linear controlled differential equation on the tensor algebra, and that it generalises both state-space models and the Laplace and Fourier transforms of the path. The group-like structure of the EWS enables efficient computation and makes the framework amenable to gradient-based learning, with the full semigroup action parametrised by and learned through its generator. We use this framework to empirically demonstrate the expressivity gap between the EWS and both the signature and EFM on two SDE-based regression tasks.
- [36] arXiv:2603.19211 [pdf, ps, other]
-
Title: Synthetic Control Misconceptions: Recommendations for PracticeSubjects: Methodology (stat.ME); Econometrics (econ.EM)
To estimate the causal effect of an intervention, researchers need to identify a control group that represents what might have happened to the treatment group in the absence of that intervention. This is challenging without a randomized experiment and further complicated when few units (possibly only one) are treated. Nevertheless, when data are available on units over time, synthetic control (SC) methods provide an opportunity to construct a valid comparison by differentially weighting control units that did not receive the treatment so that their resulting pre-treatment trajectory is similar to that of the treated unit. The hope is that this weighted ``pseudo-counterfactual" can serve as a valid counterfactual in the post-treatment time period. Since its origin twenty years ago, SC has been used over 5,000 times in the literature (Web of Science, December 2025), leading to a proliferation of descriptions of the method and guidance on proper usage that is not always accurate and does not always align with what the original developers appear to have intended. As such, a number of accepted pieces of wisdom have arisen: (1) SC is robust to various implementations; (2) covariates are unnecessary, and (3) pre-treatment prediction error should guide model selection. We describe each in detail and conduct simulations that suggest, both for standard and alternative implementations of SC, that these purported truths are not supported by empirical evidence and thus actually represent misconceptions about best practice. Instead of relying on these misconceptions, we offer practical advice for more cautious implementation and interpretation of results.
Cross-lists for Fri, 20 Mar 26
- [37] arXiv:2603.18025 (cross-list from cs.CY) [pdf, ps, other]
-
Title: Understanding the Relationship Between Firms' AI Technology Innovation and Consumer ComplaintsSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Applications (stat.AP)
In the artificial intelligence (AI) age, firms increasingly invest in AI technology innovation to secure competitive advantages. However, the relationship between firms' AI technology innovation and consumer complaints remains insufficiently explored. Drawing on Protection Motivation Theory (PMT), this paper investigates how firms' AI technology innovation influences consumer complaints. Employing a multimethod approach, Study 1 analyzes panel data from S&P 500 firms (N = 2,758 firm-year observations), Study 2 examines user-generated Reddit data (N = 2,033,814 submissions and comments), and Study 3 involves two controlled experiments (N = 410 and N = 500). The results reveal that firms' AI technology innovation significantly increases consumers' threat-related emotions, heightening their complaints. Furthermore, compared to AI process innovation, AI product innovation leads to higher consumer complaints. This paper advances the understanding of consumers' psychological responses to firms' AI innovation and provides practical implications for managing consumer complaints effectively.
- [38] arXiv:2603.18032 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Towards Differentiating Between Failures and Domain Shifts in Industrial Data StreamsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Anomaly and failure detection methods are crucial in identifying deviations from normal system operational conditions, which allows for actions to be taken in advance, usually preventing more serious damages. Long-lasting deviations indicate failures, while sudden, isolated changes in the data indicate anomalies. However, in many practical applications, changes in the data do not always represent abnormal system states. Such changes may be recognized incorrectly as failures, while being a normal evolution of the system, e.g. referring to characteristics of starting the processing of a new product, i.e. realizing a domain shift. Therefore, distinguishing between failures and such ''healthy'' changes in data distribution is critical to ensure the practical robustness of the system. In this paper, we propose a method that not only detects changes in the data distribution and anomalies but also allows us to distinguish between failures and normal domain shifts inherent to a given process. The proposed method consists of a modified Page-Hinkley changepoint detector for identification of the domain shift and possible failures and supervised domain-adaptation-based algorithms for fast, online anomaly detection. These two are coupled with an explainable artificial intelligence (XAI) component that aims at helping the human operator to finally differentiate between domain shifts and failures. The method is illustrated by an experiment on a data stream from the steel factory.
- [39] arXiv:2603.18053 (cross-list from cs.SI) [pdf, ps, other]
-
Title: Auditing the Auditors: Does Community-based Moderation Get It Right?Subjects: Social and Information Networks (cs.SI); Computers and Society (cs.CY); General Economics (econ.GN); Machine Learning (stat.ML)
Online social platforms increasingly rely on crowd-sourced systems to label misleading content at scale, but these systems must both aggregate users' evaluations and decide whose evaluations to trust. To address the latter, many platforms audit users by rewarding agreement with the final aggregate outcome, a design we term consensus-based auditing. We analyze the consequences of this design in X's Community Notes, which in September 2022 adopted consensus-based auditing that ties users' eligibility for participation to agreement with the eventual platform outcome. We find evidence of strategic conformity: minority contributors' evaluations drift toward the majority and their participation share falls on controversial topics, where independent signals matter most. We formalize this mechanism in a behavioral model in which contributors trade off private beliefs against anticipated penalties for disagreement. Motivated by these findings, we propose a two-stage auditing and aggregation algorithm that weights contributors by the stability of their past residuals rather than by agreement with the majority. The method first accounts for differences across content and contributors, and then measures how predictable each contributor's evaluations are relative to the latent-factor model. Contributors whose evaluations are consistently informative receive greater influence in aggregation, even when they disagree with the prevailing consensus. In the Community Notes data, this approach improves out-of-sample predictive performance while avoiding penalization of disagreement.
- [40] arXiv:2603.18074 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Lightweight Adaptation for LLM-based Technical Service Agent: Latent Logic Augmentation and Robust Noise ReductionAuthors: Yi Yu, Junzhuo Ma, Chenghuang Shen, Xingyan Liu, Jing Gu, Hangyi Sun, Guangquan Hu, Jianfeng Liu, Weiting Liu, Mingyue Pu, Yu Wang, Zhengdong Xiao, Rui Xie, Longjiu Luo, Qianrong Wang, Gurong Cui, Honglin Qiao, Wenlian LuSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Applications (stat.AP)
Adapting Large Language Models in complex technical service domains is constrained by the absence of explicit cognitive chains in human demonstrations and the inherent ambiguity arising from the diversity of valid responses. These limitations severely hinder agents from internalizing latent decision dynamics and generalizing effectively. Moreover, practical adaptation is often impeded by the prohibitive resource and time costs associated with standard training paradigms. To overcome these challenges and guarantee computational efficiency, we propose a lightweight adaptation framework comprising three key contributions. (1) Latent Logic Augmentation: We introduce Planning-Aware Trajectory Modeling and Decision Reasoning Augmentation to bridge the gap between surface-level supervision and latent decision logic. These approaches strengthen the stability of Supervised Fine-Tuning alignment. (2) Robust Noise Reduction: We construct a Multiple Ground Truths dataset through a dual-filtering method to reduce the noise by validating diverse responses, thereby capturing the semantic diversity. (3) Lightweight Adaptation: We design a Hybrid Reward mechanism that fuses an LLM-based judge with a lightweight relevance-based Reranker to distill high-fidelity reward signals while reducing the computational cost compared to standard LLM-as-a-Judge reinforcement learning. Empirical evaluations on real-world Cloud service tasks, conducted across semantically diverse settings, demonstrate that our framework achieves stability and performance gains through Latent Logic Augmentation and Robust Noise Reduction. Concurrently, our Hybrid Reward mechanism achieves alignment comparable to standard LLM-as-a-judge methods with reduced training time, underscoring the practical value for deploying technical service agents.
- [41] arXiv:2603.18111 (cross-list from cs.LG) [pdf, ps, other]
-
Title: BoundAD: Boundary-Aware Negative Generation for Time Series Anomaly DetectionSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Contrastive learning methods for time series anomaly detection (TSAD) heavily depend on the quality of negative sample construction. However, existing strategies based on random perturbations or pseudo-anomaly injection often struggle to simultaneously preserve temporal semantic consistency and provide effective decision-boundary supervision. Most existing methods rely on prior anomaly injection, while overlooking the potential of generating hard negatives near the data manifold boundary directly from normal samples themselves. To address this issue, we propose a reconstruction-driven boundary negative generation framework that automatically constructs hard negatives through the reconstruction process of normal samples. Specifically, the method first employs a reconstruction network to capture normal temporal patterns, and then introduces a reinforcement learning strategy to adaptively adjust the optimization update magnitude according to the current reconstruction state. In this way, boundary-shifted samples close to the normal data manifold can be induced along the reconstruction trajectory and further used for subsequent contrastive representation learning. Unlike existing methods that depend on explicit anomaly injection, the proposed framework does not require predefined anomaly patterns, but instead mines more challenging boundary negatives from the model's own learning dynamics. Experimental results show that the proposed method effectively improves anomaly representation learning and achieves competitive detection performance on the current dataset.
- [42] arXiv:2603.18195 (cross-list from econ.GN) [pdf, ps, other]
-
Title: The Role of Data and Metrics in Measuring Inequality Worldwide. A Tribute to Giovanni Andrea Cornia's Lifelong Work on the World GinisComments: 26 Pages, 5 Figures, 7 TablesSubjects: General Economics (econ.GN); Applications (stat.AP)
This paper pays tribute to Professor Giovanni Andrea Cornia's lifelong contributions to the measurement of global inequality. We review twelve world and regional databases of the Gini coefficient, illustrate their coverage, overlapping, and data gaps, and analyse the major sources of discrepancy among published Ginis. Merging all databases into a unified collection of over 122,000 observations spanning 222 countries from 1867 to 2024, we document how differences in welfare metrics, reference units, sub-metric definitions, post-survey adjustments, and survey design produce Gini estimates that diverge considerably -- sometimes by as much as 50 percentage points -- for the same country and year. We quantify pairwise cross-database discordance, document the income-consumption Gini gap by region and income group, and discuss the contributions of welfare metric and equivalence scale choices to cross-database dispersion. We extend the analysis with a dedicated discussion of comparability across time and across measurement dimensions, showing how multiple layers of methodological choice interact to make any single Gini figure a product of a complex chain of decisions that are rarely fully disclosed. Our analysis confirms that the choice of welfare metric remains the single most important source of cross-country non-comparability, while sub-metric definitions and equivalence scales introduce further systematic differences that are routinely overlooked in comparative work.
- [43] arXiv:2603.18201 (cross-list from cs.AI) [pdf, ps, other]
-
Title: A Computationally Efficient Learning of Artificial Intelligence System Reliability Considering Error PropagationComments: 42 pages, 11 figuresSubjects: Artificial Intelligence (cs.AI); Computation (stat.CO)
Artificial Intelligence (AI) systems are increasingly prominent in emerging smart cities, yet their reliability remains a critical concern. These systems typically operate through a sequence of interconnected functional stages, where upstream errors may propagate to downstream stages, ultimately affecting overall system reliability. Quantifying such error propagation is essential for accurate modeling of AI system reliability. However, this task is challenging due to: i) data availability: real-world AI system reliability data are often scarce and constrained by privacy concerns; ii) model validity: recurring error events across sequential stages are interdependent, violating the independence assumptions of statistical inference; and iii) computational complexity: AI systems process large volumes of high-speed data, resulting in frequent and complex recurrent error events that are difficult to track and analyze. To address these challenges, this paper leverages a physics-based autonomous vehicle simulation platform with a justifiable error injector to generate high-quality data for AI system reliability analysis. Building on this data, a new reliability modeling framework is developed to explicitly characterize error propagation across stages. Model parameters are estimated using a computationally efficient, theoretically guaranteed composite likelihood expectation - maximization algorithm. Its application to the reliability modeling for autonomous vehicle perception systems demonstrates its predictive accuracy and computational efficiency.
- [44] arXiv:2603.18254 (cross-list from cs.DS) [pdf, ps, other]
-
Title: Computation-Utility-Privacy Tradeoffs in Bayesian EstimationComments: To appear at STOC 2026Subjects: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
Bayesian methods lie at the heart of modern data science and provide a powerful scaffolding for estimation in data-constrained settings and principled quantification and propagation of uncertainty. Yet in many real-world use cases where these methods are deployed, there is a natural need to preserve the privacy of the individuals whose data is being scrutinized. While a number of works have attempted to approach the problem of differentially private Bayesian estimation through either reasoning about the inherent privacy of the posterior distribution or privatizing off-the-shelf Bayesian methods, these works generally do not come with rigorous utility guarantees beyond low-dimensional settings. In fact, even for the prototypical tasks of Gaussian mean estimation and linear regression, it was unknown how close one could get to the Bayes-optimal error with a private algorithm, even in the simplest case where the unknown parameter comes from a Gaussian prior. In this work, we give the first efficient algorithms for both of these problems that achieve mean-squared error $(1+o(1))\mathrm{OPT}$ and additionally show that both tasks exhibit an intriguing computational-statistical gap. For Bayesian mean estimation, we prove that the excess risk achieved by our method is optimal among all efficient algorithms within the low-degree framework, yet is provably worse than what is achievable by an exponential-time algorithm. For linear regression, we prove a qualitatively similar lower bound. Our algorithms draw upon the privacy-to-robustness framework of arXiv:2212.05015, but with the curious twist that to achieve private Bayes-optimal estimation, we need to design sum-of-squares-based robust estimators for inherently non-robust objects like the empirical mean and OLS estimator. Along the way we also add to the sum-of-squares toolkit a new kind of constraint based on short-flat decompositions.
- [45] arXiv:2603.18325 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Learning to Reason with Curriculum I: Provable Benefits of AutocurriculumAuthors: Nived Rajaraman, Audrey Huang, Miro Dudik, Robert Schapire, Dylan J. Foster, Akshay KrishnamurthyComments: 39 pages, 4 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that autocurriculum, where the model uses its own performance to decide which problems to focus training on, provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum decouples the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting and learning from counterexamples, and requiring no assumption on the distribution or difficulty of prompts.
- [46] arXiv:2603.18391 (cross-list from cs.DS) [pdf, ps, other]
-
Title: Computational and Statistical Hardness of Calibration DistanceAuthors: Mingda QiaoSubjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
The distance from calibration, introduced by B{\l}asiok, Gopalan, Hu, and Nakkiran (STOC 2023), has recently emerged as a central measure of miscalibration for probabilistic predictors. We study the fundamental problems of computing and estimating this quantity, given either an exact description of the data distribution or only sample access to it.
We give an efficient algorithm that exactly computes the calibration distance when the distribution has a uniform marginal and noiseless labels, which improves the $O(1/\sqrt{|\mathcal{X}|})$ additive approximation of Qiao and Zheng (COLT 2024) for this special case. Perhaps surprisingly, the problem becomes $\mathsf{NP}$-hard when either of the two assumptions is removed. We extend our algorithm to a polynomial-time approximation scheme for the general case.
For the estimation problem, we show that $\Theta(1/\epsilon^3)$ samples are sufficient and necessary for the empirical calibration distance to be upper bounded by the true distance plus $\epsilon$. In contrast, a polynomial dependence on the domain size -- incurred by the learning-based baseline -- is unavoidable for two-sided estimation.
Our positive results are based on simple sparsifications of both the distribution and the target predictor, which significantly reduce the search space for computation and lead to stronger concentration for the estimation problem. To prove the hardness results, we introduce new techniques for certifying lower bounds on the calibration distance -- a problem that is hard in general due to its $\textsf{co-NP}$-completeness. - [47] arXiv:2603.18482 (cross-list from cs.CL) [pdf, ps, other]
-
Title: The Truncation Blind Spot: How Decoding Strategies Systematically Exclude Human-Like Token ChoicesComments: Under reviewSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Standard decoding strategies for text generation, including top-k, nucleus sampling, and contrastive search, select tokens based on likelihood, restricting selection to high-probability regions. Human language production operates differently: tokens are chosen for communicative appropriateness rather than statistical frequency. This mismatch creates a truncation blind spot: contextually appropriate but statistically rare tokens remain accessible to humans yet unreachable by likelihood-based decoding. We hypothesize this contributes to the detectability of machine-generated text. Analyzing over 1.8 million texts across eight language models, five decoding strategies, and 53 hyperparameter configurations, we find that 8-18% of human-selected tokens fall outside typical truncation boundaries. Simple classifiers trained on predictability and lexical diversity achieve remarkable detection rates. Crucially, neither model scale nor architecture correlates strongly with detectability; truncation parameters account for most variance. Configurations achieving low detectability often produce incoherent text, indicating that evading detection and producing natural text are distinct objectives. These findings suggest detectability is enhanced by likelihood-based token selection, not merely a matter of model capability.
- [48] arXiv:2603.18538 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Beyond Passive Aggregation: Active Auditing and Topology-Aware Defense in Decentralized Federated LearningSubjects: Machine Learning (cs.LG); Methodology (stat.ME)
Decentralized Federated Learning (DFL) remains highly vulnerable to adaptive backdoor attacks designed to bypass traditional passive defense metrics. To address this limitation, we shift the defensive paradigm toward a novel active, interventional auditing framework. First, we establish a dynamical model to characterize the spatiotemporal diffusion of adversarial updates across complex graph topologies. Second, we introduce a suite of proactive auditing metrics, stochastic entropy anomaly, randomized smoothing Kullback-Leibler divergence, and activation kurtosis. These metrics utilize private probes to stress-test local models, effectively exposing latent backdoors that remain invisible to conventional static detection. Furthermore, we implement a topology-aware defense placement strategy to maximize global aggregation resilience. We provide theoretical property for the system's convergence under co-evolving attack and defense dynamics. Numeric empirical evaluations across diverse architectures demonstrate that our active framework is highly competitive with state-of-the-art defenses in mitigating stealthy, adaptive backdoors while preserving primary task utility.
- [49] arXiv:2603.18706 (cross-list from math.OC) [pdf, other]
-
Title: A mathematical framework for time-delay reservoir computing analysisAuthors: Anh-Tuan Clabaut (L2S), Jean Auriol (L2S), Islam Boussaada (L2S, DISCO, IPSA), Guilherme Mazanti (DISCO, L2S)Subjects: Optimization and Control (math.OC); Machine Learning (stat.ML)
Reservoir computing is a well-established approach for processing data with a much lower complexity compared to traditional neural networks. Despite two decades of experimental progress, the core properties of reservoir computing (namely separation, robustness, and fading memory) still lack rigorous mathematical foundations. This paper addresses this gap by providing a control-theoretic framework for the analysis of time-delay-based reservoir computers. We introduce formal definitions of the separation property and fading memory in terms of functional norms, and establish their connection to well-known stability notions for time-delay systems as incremental input-to-state stability. For a class of linear reservoirs, we derive an explicit lower bound for the separation distance via Fourier analysis, offering a computable criterion for reservoir design. Numerical results on the NARMA10 benchmark and continuous-time system prediction validate the approach with a minimal digital implementation.
- [50] arXiv:2603.18736 (cross-list from cs.LG) [pdf, ps, other]
-
Title: CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User FeedbacksAuthors: Hao Wang, Licheng Pan, Zhichao Chen, Chunyuan Zheng, Zhixuan Chu, Xiaoxi Li, Yuan Lu, Xinggao Liu, Haoxuan Li, Zhouchen LinSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly conditions. In this work, we introduce observational reward modeling -- learning reward models with observational user feedback (e.g., clicks, copies, and upvotes) -- as a scalable and cost-effective alternative. We identify two fundamental challenges in this setting: (1) observational feedback is noisy due to annotation errors, which deviates it from true user preference; (2) observational feedback is biased by user preference, where users preferentially provide feedback on responses they feel strongly about, which creats a distribution shift between training and inference data. To address these challenges, we propose CausalRM, a causal-theoretic reward modeling framework that aims to learn unbiased reward models from observational feedback. To tackle challenge (1), CausalRM introduces a noise-aware surrogate loss term that is provably equivalent to the primal loss under noise-free conditions by explicitly modeling the annotation error generation process. To tackle challenge (2), CausalRM uses propensity scores -- the probability of a user providing feedback for a given response -- to reweight training samples, yielding a loss function that eliminates user preference bias. Extensive experiments across diverse LLM backbones and benchmark datasets validate that CausalRM effectively learns accurate reward signals from noisy and biased observational feedback and delivers substantial performance improvements on downstream RLHF tasks -- including a 49.2% gain on WildGuardMix and a 32.7% improvement on HarmBench. Code is available on our project website.
- [51] arXiv:2603.18838 (cross-list from cs.LG) [pdf, ps, other]
-
Title: A Model Ensemble-Based Post-Processing Framework for Fairness-Aware PredictionSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Striking an optimal balance between predictive performance and fairness continues to be a fundamental challenge in machine learning. In this work, we propose a post-processing framework that facilitates fairness-aware prediction by leveraging model ensembling. Designed to operate independently of any specific model internals, our approach is widely applicable across various learning tasks, model architectures, and fairness definitions. Through extensive experiments spanning classification, regression, and survival analysis, we demonstrate that the framework effectively enhances fairness while maintaining, or only minimally affecting, predictive accuracy.
- [52] arXiv:2603.18846 (cross-list from cs.CV) [pdf, ps, other]
-
Title: Towards Interpretable Foundation Models for Retinal Fundus ImagesComments: 11 pages, 3 figures, 2 tables, submitted to MICCAI 2026Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Computation (stat.CO)
Foundation models are used to extract transferable representations from large amounts of unlabeled data, typically via self-supervised learning (SSL). However, many of these models rely on architectures that offer limited interpretability, which is a critical issue in high-stakes domains such as medical imaging. We propose Dual-IFM, a foundation model that is interpretable-by-design in two ways: First, it provides local interpretability for individual images through class evidence maps that are faithful to the decision-making process. Second, it provides global interpretability for entire datasets through a 2D projection layer that allows for direct visualization of the model's representation space. We trained our model on over 800,000 color fundus photography from various sources to learn generalizable, interpretable representations for different downstream tasks. Our results show that our model reaches a performance range similar to that of state-of-the-art foundation models with up to $16\times$ the number of parameters, while providing interpretable predictions on out-of-distribution data. Our results suggest that large-scale SSL pretraining paired with inherent interpretability can lead to robust representations for retinal imaging.
- [53] arXiv:2603.18870 (cross-list from econ.EM) [pdf, ps, other]
-
Title: Inference in Regression Discontinuity Designs with Clustered DataSubjects: Econometrics (econ.EM); Methodology (stat.ME)
Clustered sampling is prevalent in empirical regression discontinuity (RD) designs, but it has not received much attention in the theoretical literature. In this paper, we introduce a general model-based framework for such settings and derive high-level conditions under which the standard local linear RD estimator is asymptotically normal. We verify that our high-level assumptions hold across a wide range of empirical designs, including settings of growing cluster sizes. We further show that clustered standard errors that are currently used in practice can be either inconsistent or overly conservative in finite samples. To address these issues, we propose a novel nearest-neighbor-type variance estimator and illustrate its properties in a diverse set of empirical applications.
- [54] arXiv:2603.18957 (cross-list from cs.LG) [pdf, ps, other]
-
Title: BVSIMC: Bayesian Variable Selection-Guided Inductive Matrix Completion for Improved and Interpretable Drug DiscoverySubjects: Machine Learning (cs.LG); Methodology (stat.ME)
Recent advances in drug discovery have demonstrated that incorporating side information (e.g., chemical properties about drugs and genomic information about diseases) often greatly improves prediction performance. However, these side features can vary widely in relevance and are often noisy and high-dimensional. We propose Bayesian Variable Selection-Guided Inductive Matrix Completion (BVSIMC), a new Bayesian model that enables variable selection from side features in drug discovery. By learning sparse latent embeddings, BVSIMC improves both predictive accuracy and interpretability. We validate our method through simulation studies and two drug discovery applications: 1) prediction of drug resistance in Mycobacterium tuberculosis, and 2) prediction of new drug-disease associations in computational drug repositioning. On both synthetic and real data, BVSIMC outperforms several other state-of-the-art methods in terms of prediction. In our two real examples, BVSIMC further reveals the most clinically meaningful side features.
- [55] arXiv:2603.18965 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Maximum-Entropy Exploration with Future State-Action Visitation MeasuresComments: arXiv admin note: substantial text overlap with arXiv:2412.06655Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Maximum entropy reinforcement learning motivates agents to explore states and actions to maximize the entropy of some distribution, typically by providing additional intrinsic rewards proportional to that entropy function. In this paper, we study intrinsic rewards proportional to the entropy of the discounted distribution of state-action features visited during future time steps. This approach is motivated by two results. First, we show that the expected sum of these intrinsic rewards is a lower bound on the entropy of the discounted distribution of state-action features visited in trajectories starting from the initial states, which we relate to an alternative maximum entropy objective. Second, we show that the distribution used in the intrinsic reward definition is the fixed point of a contraction operator and can therefore be estimated off-policy. Experiments highlight that the new objective leads to improved visitation of features within individual trajectories, in exchange for slightly reduced visitation of features in expectation over different trajectories, as suggested by the lower bound. It also leads to improved convergence speed for learning exploration-only agents. Control performance remains similar across most methods on the considered benchmarks.
- [56] arXiv:2603.19005 (cross-list from cs.LG) [pdf, ps, other]
-
Title: AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data ScienceAuthors: An Luo, Jin Du, Xun Xian, Robert Specht, Fangqiao Tian, Ganghua Wang, Xuan Bi, Charles Fleming, Ashish Kundu, Jayanth Srinivasa, Mingyi Hong, Rui Zhang, Tianxi Li, Galin Jones, Jie DingSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning. AI-only baselines perform near or below the median of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: https://agentds.org/ and open source datasets here: https://huggingface.co/datasets/lainmn/AgentDS .
- [57] arXiv:2603.19061 (cross-list from cs.CG) [pdf, ps, other]
-
Title: Hardness of High-Dimensional Linear ClassificationComments: SoCG 2026Subjects: Computational Geometry (cs.CG); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
We establish new exponential in dimension lower bounds for the Maximum Halfspace Discrepancy problem, which models linear classification. Both are fundamental problems in computational geometry and machine learning in their exact and approximate forms. However, only $O(n^d)$ and respectively $\tilde O(1/\varepsilon^d)$ upper bounds are known and complemented by polynomial lower bounds that do not support the exponential in dimension dependence. We close this gap up to polylogarithmic terms by reduction from widely-believed hardness conjectures for Affine Degeneracy testing and $k$-Sum problems. Our reductions yield matching lower bounds of $\tilde\Omega(n^d)$ and respectively $\tilde\Omega(1/\varepsilon^d)$ based on Affine Degeneracy testing, and $\tilde\Omega(n^{d/2})$ and respectively $\tilde\Omega(1/\varepsilon^{d/2})$ conditioned on $k$-Sum. The first bound also holds unconditionally if the computational model is restricted to make sidedness queries, which corresponds to a widely spread setting implemented and optimized in many contemporary algorithms and computing paradigms.
- [58] arXiv:2603.19108 (cross-list from math.NA) [pdf, ps, other]
-
Title: Numerical Considerations for the Construction of Karhunen-Loève ExpansionsSubjects: Numerical Analysis (math.NA); Machine Learning (stat.ML)
This report examines numerical aspects of constructing Karhunen-Lo\`{e}ve expansions (KLEs) for second-order stochastic processes. The KLE relies on the spectral decomposition of the covariance operator via the Fredholm integral equation of the second kind, which is then discretized on a computational grid, leading to an eigendecomposition task. We derive the algebraic equivalence between this Fredholm-based eigensolution and the singular value decomposition of the weight-scaled sample matrix, yielding consistent solutions for both model-based and data-driven KLE construction. Analytical eigensolutions for exponential and squared-exponential covariance kernels serve as reference benchmarks to assess numerical consistency and accuracy in 1D settings. The convergence of SVD-based eigenvalue estimates and of the empirical distributions of the KL coefficients to their theoretical $\mathcal{N}(0,1)$ target are characterized as a function of sample count. Higher-dimensional configurations include a two-dimensional irregular domain discretized by unstructured triangular meshes with two refinement levels, and a three-dimensional toroidal domain whose non-simply-connected topology motivates a comparison between Euclidean and shortest interior path distances between the grid points. The numerical results highlight the interplay between the discretization strategy, quadrature rule, and sample count, and their impact on the KLE results.
Replacements for Fri, 20 Mar 26
- [59] arXiv:2107.08686 (replaced) [pdf, ps, other]
-
Title: Improved Learning Rates for Stochastic OptimizationComments: This version substantially revises and supersedes all previous versions. Earlier versions contained errors and should not be relied upon for the current results or statements. The manuscript has been thoroughly rewritten, with a narrowed scope, a simplified presentation, a revised focus, and corresponding updates to the title and main claims. Please refer to and cite the current versionSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
- [60] arXiv:2209.04892 (replaced) [pdf, ps, other]
-
Title: "Calibeating": Beating Forecasters at Their Own GameComments: Corrected Appendix A.7 + new Appendix A.10. Included: Addendum and Errata to the published journal version (Theoretical Economics, 2023) and to arXiv previous version v2 (2022). Web page: this http URLJournal-ref: Theoretical Economics 18 (2023), 4, 1441-1474Subjects: Theoretical Economics (econ.TH); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
- [61] arXiv:2307.12544 (replaced) [pdf, ps, other]
-
Title: Adaptive debiased machine learning using data-driven model selection techniquesSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)
- [62] arXiv:2309.08945 (replaced) [pdf, ps, other]
-
Title: Inverse classification with logistic and softmax classifiers: efficient optimizationComments: Appears in Transactions on Machine Learning Research, March 2026Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
- [63] arXiv:2312.03871 (replaced) [pdf, ps, other]
-
Title: Hidden yet quantifiable: A lower bound for confounding strength using randomized trialsComments: Accepted for presentation at the International Conference on Artificial Intelligence and Statistics (AISTATS) 2024Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
- [64] arXiv:2312.08531 (replaced) [pdf, ps, other]
-
Title: Revisiting the Last-Iterate Convergence of Stochastic Gradient MethodsComments: The preliminary version has been accepted at ICLR 2024. For the update history, please refer to the PDFSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
- [65] arXiv:2402.01972 (replaced) [pdf, ps, other]
-
Title: Combining T-learning and DR-learning: a framework for oracle-efficient estimation of causal contrastsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
- [66] arXiv:2403.07189 (replaced) [pdf, ps, other]
-
Title: A multiscale cavity method for sublinear-rank symmetric matrix factorizationComments: 65 pages. Filled out proof details, improved multiscale cavity method and its proof. Equation and theorem numbering made consistent with published versionSubjects: Information Theory (cs.IT); Disordered Systems and Neural Networks (cond-mat.dis-nn); Mathematical Physics (math-ph); Statistics Theory (math.ST)
- [67] arXiv:2404.00256 (replaced) [pdf, ps, other]
-
Title: Robust Bayesian modeling for Preprocessing Large-Scale DataAuthors: Yoshiko HayashiSubjects: Methodology (stat.ME)
- [68] arXiv:2411.04380 (replaced) [pdf, ps, other]
-
Title: Identification of Long-Term Treatment Effects via Temporal Links, Observational, and Experimental DataAuthors: Filip ObradovićSubjects: Econometrics (econ.EM); Methodology (stat.ME)
- [69] arXiv:2412.02484 (replaced) [pdf, ps, other]
-
Title: Vector Optimization with Gaussian Process BanditsSubjects: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
- [70] arXiv:2501.00744 (replaced) [pdf, ps, other]
-
Title: Assessing the Distributional Fidelity of Synthetic Chest X-rays using the Embedded Characteristic ScoreSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
- [71] arXiv:2501.02364 (replaced) [pdf, ps, other]
-
Title: Linearly Separable Features in Shallow Nonlinear Networks: Width Scales Polynomially with Intrinsic Data DimensionComments: 33 pages, 10 figuresSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
- [72] arXiv:2502.05322 (replaced) [pdf, ps, other]
-
Title: Tropical Fréchet Means: a polyhedral approach to exact optimizationComments: 26 pages. 8 figures. v3: Added Section 5. Extended version as to appear in the special issue for the International Symposium on Symbolic and Algebraic Computation ISSAC 2025Journal-ref: Journal of Symbolic Computation (2026) 102572Subjects: Optimization and Control (math.OC); Combinatorics (math.CO); Metric Geometry (math.MG); Statistics Theory (math.ST)
- [73] arXiv:2502.08416 (replaced) [pdf, ps, other]
-
Title: Multifidelity Simulation-based Inference for Computationally Expensive SimulatorsAuthors: Anastasia N. Krouglova, Hayden R. Johnson, Basile Confavreux, Michael Deistler, Pedro J. GonçalvesComments: Accepted at ICLR 2026. Available at OpenReview: this https URLSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
- [74] arXiv:2504.09573 (replaced) [pdf, ps, other]
-
Title: A grid-based methodology for fast online changepoint detectionAuthors: Per August Jarval MoenSubjects: Methodology (stat.ME)
- [75] arXiv:2505.22659 (replaced) [pdf, ps, other]
-
Title: A General Marked Point Process Framework For Self-Exciting Network EvolutionSubjects: Methodology (stat.ME)
- [76] arXiv:2506.10586 (replaced) [pdf, ps, other]
-
Title: Size-adaptive Hypothesis Testing for FairnessJournal-ref: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
- [77] arXiv:2507.04303 (replaced) [pdf, ps, other]
-
Title: Forecasting age distribution of deaths across countries: Life expectancy and annuity valuationComments: 34 pages, 15 figures, 5 tablesSubjects: Applications (stat.AP)
- [78] arXiv:2507.04668 (replaced) [pdf, ps, other]
-
Title: Forward Regression via Gram-Schmidt Orthogonalization for Ultra-High Dimensional Linear ModelsSubjects: Methodology (stat.ME); Econometrics (econ.EM)
- [79] arXiv:2507.06542 (replaced) [pdf, ps, other]
-
Title: On the Surprising Effectiveness of a Single Global Merging in Decentralized LearningComments: We discover and theoretically explain why and when a single global parameter merging in decentralized learning can recover the performance of federated learning, even in highly heterogeneous and communication-constrained environmentsJournal-ref: ICLR 2026 (Oral Presentation)Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
- [80] arXiv:2507.08193 (replaced) [pdf, ps, other]
-
Title: Entity-Specific Cyber Risk Assessment using InsurTech Empowered Risk FactorsComments: Variance 19 (February)Subjects: Risk Management (q-fin.RM); Machine Learning (cs.LG); Machine Learning (stat.ML)
- [81] arXiv:2507.23240 (replaced) [pdf, ps, other]
-
Title: A-optimal Designs under Generalized Linear ModelsComments: 34 pages, 2 figure, 9 tablesSubjects: Methodology (stat.ME); Computation (stat.CO)
- [82] arXiv:2508.07473 (replaced) [pdf, ps, other]
-
Title: Online Convex Optimization with Heavy Tails: Old Algorithms, New Regrets, and ApplicationsAuthors: Zijian LiuComments: A short, self-contained version has been accepted at ALT 2026. Update to include the change in the camera-ready versionSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
- [83] arXiv:2508.19949 (replaced) [pdf, ps, other]
-
Title: Estimating non-linear functionals of trawl processesAuthors: Orimar SauriSubjects: Probability (math.PR); Statistics Theory (math.ST)
- [84] arXiv:2509.21181 (replaced) [pdf, ps, other]
-
Title: Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ biasSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
- [85] arXiv:2509.22459 (replaced) [pdf, ps, other]
-
Title: Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)Authors: Nikita Kornilov, David Li, Tikhon Mavrin, Aleksei Leonov, Nikita Gushchin, Evgeny Burnaev, Iaroslav Koshelev, Alexander KorotinSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
- [86] arXiv:2510.04265 (replaced) [pdf, ps, other]
-
Title: Don't Pass@k: A Bayesian Framework for Large Language Model EvaluationComments: OpenReview (ICLR 2026): this https URLSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Statistics Theory (math.ST); Machine Learning (stat.ML)
- [87] arXiv:2510.20012 (replaced) [pdf, ps, other]
-
Title: AI Pose Analysis and Kinematic Profiling of Range-of-Motion Variations in Resistance TrainingAuthors: Adam DiamantSubjects: Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV)
- [88] arXiv:2510.20742 (replaced) [pdf, ps, other]
-
Title: Bayesian Prediction under Moment ConditioningComments: Fixed typos, updated references, minor notational clarifications addedSubjects: Statistics Theory (math.ST)
- [89] arXiv:2511.17928 (replaced) [pdf, ps, other]
-
Title: Limit Theorems for Network Data without Metric StructureSubjects: Econometrics (econ.EM); Statistics Theory (math.ST)
- [90] arXiv:2512.20305 (replaced) [pdf, ps, other]
-
Title: A Structured Nonparametric Framework for Nonlinear Accelerated Failure Time Models (KAN-AFT)Comments: A new development in Survival Analysis based on the celebrated Kolmogorov-Arnold Networks (KANs)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
- [91] arXiv:2512.23178 (replaced) [pdf, ps, other]
-
Title: Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined AnalysisAuthors: Zijian LiuComments: A preliminary conference version is accepted at ICLR 2026. This full version includes the formal statements of lower bounds and their proofs. Moreover, the upper bounds are slightly improvedSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
- [92] arXiv:2602.06175 (replaced) [pdf, ps, other]
-
Title: Optimal rates for density and mode estimation with expand-and-sparsify representationsComments: Accepted at AISTATS 2026Subjects: Statistics Theory (math.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
- [93] arXiv:2602.07684 (replaced) [pdf, ps, other]
-
Title: Quantifying resilience for distribution system customers with SALEDISubjects: Systems and Control (eess.SY); Applications (stat.AP)
- [94] arXiv:2602.18328 (replaced) [pdf, ps, other]
-
Title: Smoothness and other hyperparameter estimation for inverse problems related to data assimilationComments: 28 pages, 11 figuresSubjects: Computation (stat.CO)
- [95] arXiv:2603.04172 (replaced) [pdf, ps, other]
-
Title: The Pivotal Information CriterionSubjects: Statistics Theory (math.ST); Information Theory (cs.IT); Methodology (stat.ME)
- [96] arXiv:2603.07313 (replaced) [pdf, ps, other]
-
Title: Adversarial Latent-State Training for Robust Policies in Partially Observable DomainsAuthors: Angad Singh AhujaComments: 30 pages, 8 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
- [97] arXiv:2603.12422 (replaced) [pdf, ps, other]
-
Title: Mortgage Burnout and Selection Effects in Heterogeneous Cox Hazard ModelsAuthors: Andrew LesniewskiComments: 8 pages. Added a subsection on the Cox modelSubjects: Mathematical Finance (q-fin.MF); General Economics (econ.GN); Methodology (stat.ME)
- [98] arXiv:2603.14324 (replaced) [pdf, ps, other]
-
Title: Learning-to-Defer with Expert-Conditioned AdviceSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
- [99] arXiv:2603.14431 (replaced) [pdf, ps, other]
-
Title: Deviation Tests for a High-dimensional MeanSubjects: Methodology (stat.ME)
- [100] arXiv:2603.14561 (replaced) [pdf, ps, other]
-
Title: Refined Inference for Asymptotically Linear Estimators with Non-Negligible Second-Order RemaindersComments: 32 paged 3 tables, 1 supplementSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
- [101] arXiv:2603.14601 (replaced) [pdf, ps, other]
-
Title: $K-$means with learned metricsSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR)
- [102] arXiv:2603.16982 (replaced) [pdf, ps, other]
-
Title: Trajectory Stability and Signature Diagnostics for Comet-Based Interstellar NavigationAuthors: Bo Pieter Johannes AndréeComments: 31 pages, 2 figuresSubjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Dynamical Systems (math.DS); Applications (stat.AP)
- [103] arXiv:2603.17381 (replaced) [pdf, ps, other]
-
Title: An Auditable AI Agent Loop for Empirical Economics: A Case Study in Forecast CombinationAuthors: Minchul ShinComments: 32 pages, no figureSubjects: Econometrics (econ.EM); Machine Learning (stat.ML)
[ showing up to 2000 entries per page: fewer | more ]
Disable MathJax (What is MathJax?)
Links to: arXiv, form interface, find, stat, recent, 2603, contact, help (Access key information)