reproducibilityindex.ai

Truncated Linear Regression in High Dimensions

Authors: Constantinos Daskalakis, Dhruv Rohatgi, Emmanouil Zampetakis

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We prove that there exists a computationally and statistically efﬁcient method for recovering k-sparse n-dimensional vectors x from m truncated samples, which attains an optimal ℓ2 reconstruction error of O( p (k log n)/m). As a corollary, our guarantees imply a computationally efﬁcient and information-theoretically optimal algorithm for compressed sensing with truncation, which may arise from measurement saturation effects. Our result follows from a statistical and computational analysis of the Stochastic Gradient Descent (SGD) algorithm for solving a natural adaptation of the LASSO optimization problem that accommodates truncation. Our proof then consists of two parts. First, we show statistical recovery, i.e. we upper bound the number of samples that are needed for the solution of the truncated LASSO program to be close to the true coefﬁcient vector x . Second, we show that this optimization problem can be solved efﬁciently. The cornerstones of our proof are the two seminal approaches that we mentioned: the Primal-Dual Witness method for statistical recovery in high dimensions, and the Projected SGD method for efﬁcient maximum likelihood estimation in the presence of truncation.
Researcher Affiliation	Academia	Constantinos Daskalakis MIT costis@mit.edu Dhruv Rohatgi MIT drohatgi@mit.edu Manolis Zampetakis MIT mzampet@mit.edu
Pseudocode	Yes	Section E contains a summary of the complete algorithm in pseudocode.
Open Source Code	No	The paper does not provide any explicit statement or link indicating the availability of open-source code for the described methodology.
Open Datasets	No	The paper describes a synthetic data generation process for samples (Ai, yi) rather than using an existing publicly available dataset. No access information for a public dataset is provided.
Dataset Splits	No	The paper does not provide specific training, validation, or test dataset splits. It describes theoretical guarantees for 'm samples' but not their partitioning for empirical evaluation.
Hardware Specification	No	The paper does not provide any specific hardware details (e.g., GPU/CPU models, memory) used for running experiments.
Software Dependencies	No	The paper does not provide specific software dependency details, such as library names with version numbers, needed to replicate the work.
Experiment Setup	No	The paper does not provide specific experimental setup details, such as concrete hyperparameter values or training configurations.