Truncated Linear Regression in High Dimensions
Authors: Constantinos Daskalakis, Dhruv Rohatgi, Emmanouil Zampetakis
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We prove that there exists a computationally and statistically efficient method for recovering k-sparse n-dimensional vectors x from m truncated samples, which attains an optimal ℓ2 reconstruction error of O( p (k log n)/m). As a corollary, our guarantees imply a computationally efficient and information-theoretically optimal algorithm for compressed sensing with truncation, which may arise from measurement saturation effects. Our result follows from a statistical and computational analysis of the Stochastic Gradient Descent (SGD) algorithm for solving a natural adaptation of the LASSO optimization problem that accommodates truncation. Our proof then consists of two parts. First, we show statistical recovery, i.e. we upper bound the number of samples that are needed for the solution of the truncated LASSO program to be close to the true coefficient vector x . Second, we show that this optimization problem can be solved efficiently. The cornerstones of our proof are the two seminal approaches that we mentioned: the Primal-Dual Witness method for statistical recovery in high dimensions, and the Projected SGD method for efficient maximum likelihood estimation in the presence of truncation. |
| Researcher Affiliation | Academia | Constantinos Daskalakis MIT costis@mit.edu Dhruv Rohatgi MIT drohatgi@mit.edu Manolis Zampetakis MIT mzampet@mit.edu |
| Pseudocode | Yes | Section E contains a summary of the complete algorithm in pseudocode. |
| Open Source Code | No | The paper does not provide any explicit statement or link indicating the availability of open-source code for the described methodology. |
| Open Datasets | No | The paper describes a synthetic data generation process for samples (Ai, yi) rather than using an existing publicly available dataset. No access information for a public dataset is provided. |
| Dataset Splits | No | The paper does not provide specific training, validation, or test dataset splits. It describes theoretical guarantees for 'm samples' but not their partitioning for empirical evaluation. |
| Hardware Specification | No | The paper does not provide any specific hardware details (e.g., GPU/CPU models, memory) used for running experiments. |
| Software Dependencies | No | The paper does not provide specific software dependency details, such as library names with version numbers, needed to replicate the work. |
| Experiment Setup | No | The paper does not provide specific experimental setup details, such as concrete hyperparameter values or training configurations. |