Efficient Truncated Linear Regression with Unknown Noise Variance
Authors: Constantinos Daskalakis, Patroklos Stefanou, Rui Yao, Emmanouil Zampetakis
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Figure 1, we show how our algorithm performs against models with large noise variances. For this experiment, we sample a 2 dimensional ground-truth weight vector 𝑤* 𝑖.𝑖.𝑑. 𝒰( 1, 1) and generate 1000 (𝑥(𝑖), 𝑦(𝑖)) pairs according to 𝑥(𝑖) 𝑖.𝑖.𝑑. 𝒰( 5, 5) and 𝑦(𝑖) = 𝑤*𝑇𝑥(𝑖) + 𝜀(𝑖); where 𝜀(𝑖) 𝒩(0, 𝜎*2). We fix 𝑤* and 𝑥(0), ..., 𝑥(𝑛) across all trials, but vary 𝜎*2. After adding Gaussian noise to our ground-truth predictions, we left truncate at zero, removing all of the pairs who s 𝑦(𝑖) is negative; retaining approximately 50% of the original samples. We vary 𝜎*2 over the interval [1, 20] and evaluate how well our procedure recovers 𝑤* and 𝜎*2 in comparison to OLS, Daskalakis et al. (2019), assuming 𝜎2 0 is 𝜎*2, Daskalakis et al. (2019), given 𝜎*2, and truncreg, an R package for truncated regression. |
| Researcher Affiliation | Academia | Constantinos Daskalakis *EECS and CSAIL, MIT costis@csail.mit.edu Patroklos Stefanou EECS and CSAIL, MIT stefanou@mit.edu Rui Yao EECS and CSAIL, MIT rayyao@mit.edu Manolis Zampetakis EECS, UC Berkeley mzampet@berkeley.edu |
| Pseudocode | Yes | Algorithm 1 Projected SGD on ℓ |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code for the described methodology. |
| Open Datasets | Yes | For this experiment, we used the Aldrin (2004) dataset. The Aldrin (2004) dataset was originally collected by the Norwegian Public Roads Administration for a study of air pollution at a road in Oslo, Norway. The dataset consists of 500 observations. Interestingly enough, it is common for environmental data to be truncated because of problems in reliably measuring low concentrations. Aldrin, M. Pm10 dataset, 2004. URL http://lib.stat.cmu.edu/datasets/PM10.dat. |
| Dataset Splits | No | The paper mentions splitting samples into three parts for different algorithm steps (OLS estimates, 𝜎0 estimation, PSGD algorithm) but does not provide explicit train/validation/test splits with percentages or sample counts for evaluation purposes. For the synthetic experiments, it states 'we randomly select a k samples', which is not a predefined train/validation split. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU types, or memory used for running the experiments. |
| Software Dependencies | Yes | We vary 𝜎*2 over the interval [1, 20] and evaluate how well our procedure recovers 𝑤* and 𝜎*2 in comparison to OLS, Daskalakis et al. (2019), assuming 𝜎2 0 is 𝜎*2, Daskalakis et al. (2019), given 𝜎*2, and truncreg, an R package for truncated regression. truncreg: Estimation of models for truncated gaussian variables by maximum likelihood, 2018. URL https://cran.r-project.org/web/packages/truncreg/ truncreg.pdf. R package version 1.8.0. |
| Experiment Setup | No | The paper describes variations in experimental parameters like 𝜎*2 (from 1 to 20) and number of samples k (from 10 to 5000), and truncation parameter C (from 1.0 to 4.0), and notes that the experiments used 'the same hyperparameters as for the synthetic data experiments.' It also states 'We run our procedure a total of 10 times and retain the trial that has the smallest gradient as our algorithm s prediction.' However, it does not explicitly provide concrete hyperparameter values for the PSGD algorithm itself, such as specific learning rates, batch sizes, or number of training epochs. |