Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Beyond Least Squares: Uniform Approximation and the Hidden Cost of Misspecification

Authors: Davide Maran, Csaba Szepesvari

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Beyond Least Squares: Uniform Approximation and the Hidden Cost of Misspecification Davide Maran Politecnico di Milano EMAIL Csaba Szepesvári Google Deep Mind and University of Alberta EMAIL We study the problem of controlling worst-case errors in misspecified linear regression under the random design setting, where the regression function is estimated via (penalized) least-squares. This setting arises naturally in value function approximation for bandit algorithms and reinforcement learning (RL). Our first main contribution is the observation that the amplification of the misspecification error when using least-squares is governed by the Lebesgue constant, a classical quantity from approximation theory that depends on the choice of the feature subspace and the covariate distribution. We also show that this dependence on the misspecification error is tight for least-squares regression: in general, no method minimizing the empirical squared loss, including regularized least-squares, can improve it substantially. We argue this explains the empirical observation that some feature-maps (e.g., those derived from the Fourier bases) work better in RL than others (e.g., polynomials): given some covariate distribution, the Lebesgue constant is known to be highly sensitive to choice of the feature-map. As a second contribution, we propose a method that augments the original feature set with auxiliary features designed to reduce the error amplification. We then prove that the method successfully competes with an oracle that knows the best way of using the auxiliary features to reduce this amplification. For example, when the domain is a real interval and the features are monomials, our method reduces the amplification factor to O(1) as d , while without our method, least-squares with the monomials (and in fact polynomials) will suffer a worst-case error amplification of order Ω(d). It follows that there are functions and feature maps for which our method is consistent, while least-squares is inconsistent. Figure 1: Comparison between the OLS estimator and the BWR estimator using polynomial features on [ 1, 1], with d = 10 features (left) and d = 15 features (right). The inputs are chosen uniformly at random from [ 1, 1]. Even if the true function is bounded, OLS suffers from large oscillations near the boundaries due to the high Lebesgue constant. In contrast, BWR achieves a much more uniform approximation error across the domain by effectively controlling the amplification effect.
Researcher Affiliation Collaboration Davide Maran Politecnico di Milano EMAIL Csaba Szepesvári Google Deep Mind and University of Alberta EMAIL
Pseudocode Yes Algorithm 1 Subgradient Method Require: Feature map φD, d, Number I of iterations Ensure: Sequence α AD d 1: Compute bφD from φD via Gram-Schmidt orthogonalization 2: Define the following loss: J(α) = M(α) 3: Initialize α(0) [ones(d), zeros(D d)] 4: for ℓ= 1 to I do 5: Compute step size γℓ= D d bφ2,d ℓ+1 6: Compute a subgradient gℓ J(α(ℓ 1)) 7: Update: α(ℓ) = α(ℓ 1) γℓgℓ 8: if α(ℓ) / AD d then 9: Project: h(ℓ) = ΠHD d α(ℓ) 10: end if 11: end for 12: return α = α(I)
Open Source Code Yes Answer: [Yes] Justification: We include the code in the supplementary material (very simple, just one very short Jupyter notebook)
Open Datasets No The paper describes a case study where "X = [ -1, 1] and the data-generating distribution µ is uniform on this interval." This indicates a synthetic data generation process for simulation rather than the use of a named, publicly available dataset with concrete access information.
Dataset Splits No The paper mentions that "The inputs are chosen uniformly at random from [ -1, 1]" for the simulations. This describes a data generation strategy for a case study but does not specify explicit training, validation, or test splits of a predefined dataset.
Hardware Specification No Answer: [NA] Justification: The simulation is straightforward and its computational time is negligible.
Software Dependencies No The paper states that "We include the code in the supplementary material (very simple, just one very short Jupyter notebook)". While this implies the use of Python, no specific versions of Python or any other libraries or solvers are mentioned, which are required for a reproducible description of software dependencies.
Experiment Setup Yes Figure 1: Comparison between the OLS estimator and the BWR estimator using polynomial features on [ 1, 1], with d = 10 features (left) and d = 15 features (right). The inputs are chosen uniformly at random from [ 1, 1].