Sample Amplification: Increasing Dataset Size even when Learning is Impossible
Authors: Brian Axelrod, Shivam Garg, Vatsal Sharan, Gregory Valiant
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Figure 2: Performance improvement by amplification. See example 2 for a description of the setup. and Example 2. Given labeled examples, (x1, y1), . . . , (xn, yn) drawn from a distribution, D, with xi Rd and yi R, a natural quantity to estimate is the fraction of variance in y explainable as a linear function of x: infθ Rd E(x,y) D[(θT x y)2]. The standard unbiased estimator for this quantity is the training error of the least-squares linear model, scaled by a factor of 1 n d. This scaling factor makes this estimate unbiased, although the variance is large when n is not much larger than d. Figure 2 shows the expected squared error of this estimator on raw samples, and on (n, n + 2) amplified samples, in the case where xi N(0, Id), and yi = θT xi +η for some model θ 2 = 1 and independent noise η N(0, 1/4) hence the true value for the unexplainable variance is 1/4. |
| Researcher Affiliation | Academia | 1Stanford University, United States. Correspondence to: Brian Axelrod <baxelrod@stanford.edu>, Shivam Garg <shivamgarg@stanford.edu>, Vatsal Sharan <vsharan@cs.stanford.edu>, Gregory Valiant <valiant@stanford.edu>. |
| Pseudocode | Yes | Algorithm 1 Sample Amplification for Gaussian with Unknown Mean and Fixed Covariance |
| Open Source Code | No | The paper does not contain any statement or link indicating the public release of source code for the described methodology. |
| Open Datasets | No | The paper describes a 'toy example' where data is generated synthetically ('xi N(0, Id), and yi = θT xi +η for some model θ 2 = 1 and independent noise η N(0, 1/4)'). It does not use or provide access information for a pre-existing publicly available dataset. |
| Dataset Splits | No | The paper describes a simulated 'toy example' for demonstration purposes and does not specify training, validation, or test dataset splits. |
| Hardware Specification | No | The paper describes theoretical results and a simulated 'toy example' but does not specify any hardware details (e.g., GPU/CPU models, memory) used for running experiments or simulations. |
| Software Dependencies | No | The paper does not list any specific software dependencies or their version numbers required to reproduce the experimental setup, focusing instead on theoretical contributions and a conceptual example. |
| Experiment Setup | Yes | Example 2. Given labeled examples, (x1, y1), . . . , (xn, yn) drawn from a distribution, D, with xi Rd and yi R...in the case where xi N(0, Id), and yi = θT xi +η for some model θ 2 = 1 and independent noise η N(0, 1/4)...Here, the amplification procedure draws two additional datapoints, x from the isotropic Gaussian with mean equal to the empirical mean, and labels them according to the learned least-squares regression model ˆθ with independent noise of variance 5/n times the empirical estimate of the unexplained variance. |