Truncated Linear Regression in High Dimensions

Authors: Constantinos Daskalakis, Dhruv Rohatgi, Emmanouil Zampetakis

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We prove that there exists a computationally and statistically efficient method for recovering k-sparse n-dimensional vectors x from m truncated samples, which attains an optimal ℓ2 reconstruction error of O( p (k log n)/m). As a corollary, our guarantees imply a computationally efficient and information-theoretically optimal algorithm for compressed sensing with truncation, which may arise from measurement saturation effects. Our result follows from a statistical and computational analysis of the Stochastic Gradient Descent (SGD) algorithm for solving a natural adaptation of the LASSO optimization problem that accommodates truncation. Our proof then consists of two parts. First, we show statistical recovery, i.e. we upper bound the number of samples that are needed for the solution of the truncated LASSO program to be close to the true coefficient vector x . Second, we show that this optimization problem can be solved efficiently. The cornerstones of our proof are the two seminal approaches that we mentioned: the Primal-Dual Witness method for statistical recovery in high dimensions, and the Projected SGD method for efficient maximum likelihood estimation in the presence of truncation.
Researcher Affiliation Academia Constantinos Daskalakis MIT costis@mit.edu Dhruv Rohatgi MIT drohatgi@mit.edu Manolis Zampetakis MIT mzampet@mit.edu
Pseudocode Yes Section E contains a summary of the complete algorithm in pseudocode.
Open Source Code No The paper does not provide any explicit statement or link indicating the availability of open-source code for the described methodology.
Open Datasets No The paper describes a synthetic data generation process for samples (Ai, yi) rather than using an existing publicly available dataset. No access information for a public dataset is provided.
Dataset Splits No The paper does not provide specific training, validation, or test dataset splits. It describes theoretical guarantees for 'm samples' but not their partitioning for empirical evaluation.
Hardware Specification No The paper does not provide any specific hardware details (e.g., GPU/CPU models, memory) used for running experiments.
Software Dependencies No The paper does not provide specific software dependency details, such as library names with version numbers, needed to replicate the work.
Experiment Setup No The paper does not provide specific experimental setup details, such as concrete hyperparameter values or training configurations.