Sample Amplification: Increasing Dataset Size even when Learning is Impossible

Authors: Brian Axelrod, Shivam Garg, Vatsal Sharan, Gregory Valiant

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 2: Performance improvement by amplification. See example 2 for a description of the setup. and Example 2. Given labeled examples, (x1, y1), . . . , (xn, yn) drawn from a distribution, D, with xi Rd and yi R, a natural quantity to estimate is the fraction of variance in y explainable as a linear function of x: infθ Rd E(x,y) D[(θT x y)2]. The standard unbiased estimator for this quantity is the training error of the least-squares linear model, scaled by a factor of 1 n d. This scaling factor makes this estimate unbiased, although the variance is large when n is not much larger than d. Figure 2 shows the expected squared error of this estimator on raw samples, and on (n, n + 2) amplified samples, in the case where xi N(0, Id), and yi = θT xi +η for some model θ 2 = 1 and independent noise η N(0, 1/4) hence the true value for the unexplainable variance is 1/4.
Researcher Affiliation Academia 1Stanford University, United States. Correspondence to: Brian Axelrod <baxelrod@stanford.edu>, Shivam Garg <shivamgarg@stanford.edu>, Vatsal Sharan <vsharan@cs.stanford.edu>, Gregory Valiant <valiant@stanford.edu>.
Pseudocode Yes Algorithm 1 Sample Amplification for Gaussian with Unknown Mean and Fixed Covariance
Open Source Code No The paper does not contain any statement or link indicating the public release of source code for the described methodology.
Open Datasets No The paper describes a 'toy example' where data is generated synthetically ('xi N(0, Id), and yi = θT xi +η for some model θ 2 = 1 and independent noise η N(0, 1/4)'). It does not use or provide access information for a pre-existing publicly available dataset.
Dataset Splits No The paper describes a simulated 'toy example' for demonstration purposes and does not specify training, validation, or test dataset splits.
Hardware Specification No The paper describes theoretical results and a simulated 'toy example' but does not specify any hardware details (e.g., GPU/CPU models, memory) used for running experiments or simulations.
Software Dependencies No The paper does not list any specific software dependencies or their version numbers required to reproduce the experimental setup, focusing instead on theoretical contributions and a conceptual example.
Experiment Setup Yes Example 2. Given labeled examples, (x1, y1), . . . , (xn, yn) drawn from a distribution, D, with xi Rd and yi R...in the case where xi N(0, Id), and yi = θT xi +η for some model θ 2 = 1 and independent noise η N(0, 1/4)...Here, the amplification procedure draws two additional datapoints, x from the isotropic Gaussian with mean equal to the empirical mean, and labels them according to the learned least-squares regression model ˆθ with independent noise of variance 5/n times the empirical estimate of the unexplained variance.