Unbiased estimates for linear regression via volume sampling

Authors: Michal Derezinski, Manfred K. K. Warmuth

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We show that this is possible if the subset of columns is chosen proportional to the squared volume spanned by the rows of the chosen submatrix (ie, volume sampling). The resulting estimator is unbiased and surprisingly the covariance of the estimator also has a closed form: It equals a specific factor times X+ X+. Pseudo inverse plays an important part in solving the linear least squares problem, where we try to predict a label for each column of X. We assume labels are expensive and we are only given the labels for the small subset of columns we sample from X. Using our methods we show that the weight vector of the solution for the sub problem is an unbiased estimator of the optimal solution for the whole problem based on all column labels. We believe that these new formulas establish a fundamental connection between linear least squares and volume sampling. We use our methods to obtain an algorithm for volume sampling that is faster than state-of-the-art and for obtaining bounds for the total loss of the estimated least-squares solution on all labeled columns.
Researcher Affiliation Academia Michał Derezi nski Department of Computer Science University of California Santa Cruz mderezin@ucsc.edu Manfred K. Warmuth Department of Computer Science University of California Santa Cruz manfred@ucsc.edu
Pseudocode Yes Reverse iterative volume sampling Input: X Rd n, s {d..n} Z (XX ) 1 i {1..n} pi 1 x i Zxi S {1, .., n} while |S| > s Sample i pi out of S S S {i} v Zxi/ pi j S pj pj (x j v)2 Z Z + vv end return S
Open Source Code No The paper does not provide any link or explicit statement about the availability of source code for the methodology described.
Open Datasets No The paper is theoretical and does not use or refer to specific datasets, nor does it provide concrete access information for any dataset.
Dataset Splits No The paper is theoretical and does not describe any dataset splits for training, validation, or testing.
Hardware Specification No The paper describes theoretical work and does not mention any specific hardware used for running experiments.
Software Dependencies No The paper is theoretical and does not mention any software dependencies with specific version numbers.
Experiment Setup No The paper is theoretical and does not describe any experimental setup details such as hyperparameters or system-level training settings.