reproducibilityindex.ai

Efficient Truncated Linear Regression with Unknown Noise Variance

Authors: Constantinos Daskalakis, Patroklos Stefanou, Rui Yao, Emmanouil Zampetakis

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In Figure 1, we show how our algorithm performs against models with large noise variances. For this experiment, we sample a 2 dimensional ground-truth weight vector 𝑤* 𝑖.𝑖.𝑑. 𝒰( 1, 1) and generate 1000 (𝑥(𝑖), 𝑦(𝑖)) pairs according to 𝑥(𝑖) 𝑖.𝑖.𝑑. 𝒰( 5, 5) and 𝑦(𝑖) = 𝑤𝑇𝑥(𝑖) + 𝜀(𝑖); where 𝜀(𝑖) 𝒩(0, 𝜎2). We ﬁx 𝑤* and 𝑥(0), ..., 𝑥(𝑛) across all trials, but vary 𝜎2. After adding Gaussian noise to our ground-truth predictions, we left truncate at zero, removing all of the pairs who s 𝑦(𝑖) is negative; retaining approximately 50% of the original samples. We vary 𝜎2 over the interval [1, 20] and evaluate how well our procedure recovers 𝑤* and 𝜎2 in comparison to OLS, Daskalakis et al. (2019), assuming 𝜎2 0 is 𝜎2, Daskalakis et al. (2019), given 𝜎*2, and truncreg, an R package for truncated regression.
Researcher Affiliation	Academia	Constantinos Daskalakis *EECS and CSAIL, MIT costis@csail.mit.edu Patroklos Stefanou EECS and CSAIL, MIT stefanou@mit.edu Rui Yao EECS and CSAIL, MIT rayyao@mit.edu Manolis Zampetakis EECS, UC Berkeley mzampet@berkeley.edu
Pseudocode	Yes	Algorithm 1 Projected SGD on ℓ
Open Source Code	No	The paper does not provide an explicit statement or link for open-source code for the described methodology.
Open Datasets	Yes	For this experiment, we used the Aldrin (2004) dataset. The Aldrin (2004) dataset was originally collected by the Norwegian Public Roads Administration for a study of air pollution at a road in Oslo, Norway. The dataset consists of 500 observations. Interestingly enough, it is common for environmental data to be truncated because of problems in reliably measuring low concentrations. Aldrin, M. Pm10 dataset, 2004. URL http://lib.stat.cmu.edu/datasets/PM10.dat.
Dataset Splits	No	The paper mentions splitting samples into three parts for different algorithm steps (OLS estimates, 𝜎0 estimation, PSGD algorithm) but does not provide explicit train/validation/test splits with percentages or sample counts for evaluation purposes. For the synthetic experiments, it states 'we randomly select a k samples', which is not a predefined train/validation split.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies	Yes	We vary 𝜎2 over the interval [1, 20] and evaluate how well our procedure recovers 𝑤 and 𝜎2 in comparison to OLS, Daskalakis et al. (2019), assuming 𝜎2 0 is 𝜎2, Daskalakis et al. (2019), given 𝜎*2, and truncreg, an R package for truncated regression. truncreg: Estimation of models for truncated gaussian variables by maximum likelihood, 2018. URL https://cran.r-project.org/web/packages/truncreg/ truncreg.pdf. R package version 1.8.0.
Experiment Setup	No	The paper describes variations in experimental parameters like 𝜎*2 (from 1 to 20) and number of samples k (from 10 to 5000), and truncation parameter C (from 1.0 to 4.0), and notes that the experiments used 'the same hyperparameters as for the synthetic data experiments.' It also states 'We run our procedure a total of 10 times and retain the trial that has the smallest gradient as our algorithm s prediction.' However, it does not explicitly provide concrete hyperparameter values for the PSGD algorithm itself, such as specific learning rates, batch sizes, or number of training epochs.