Towards a statistical theory of data selection under weak supervision

Authors: Germain Kolossov, Andrea Montanari, Pulkit Tandon

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under lowand highdimensional asymptotics, we show that: (i) Data selection can be very effective, in particular beating training on the full sample in some cases; (ii) Certain popular choices in data selection methods (e.g. unbiased reweighted subsampling, or influence function-based subsampling) can be substantially suboptimal.
Researcher Affiliation Industry Germain Kolossov, Andrea Montanari, Pulkit Tandon Granica Computing, Inc. {germain.kolossov,andrea.montanari,pulkit.tandon}@granica.ai
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks clearly labeled as "Pseudocode" or "Algorithm".
Open Source Code Yes The code to reproduce theory and simulations with synthetic data can be found at the github repo.
Open Datasets Yes We use a subset of images obtained from the KITTI-360 train set (Liao et al., 2022).
Dataset Splits Yes We randomly partition this dataset into four disjoint sets: Ntrain = 34, 345 images to perform subsampling and train models, Nsurr = 14, 720 images to train surrogate models, Nval = 3665 images for validation and Ntest = 8550 images for reporting the final experiment results.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments.
Software Dependencies No The paper mentions "scikit-learn library (Pedregosa et al., 2011)" and "torch.hub.load( facebookresearch/swav:main , resnet50 )" but does not provide specific version numbers for these or other key software components (e.g., Python, PyTorch).
Experiment Setup Yes The training utilized the L-BFGS optimization algorithm with a cap of 10,000 iterations implemented using the scikit-learn library... The ridge regularization parameter λ. This is either fixed or selected optimally by taking λ = arg minλ ΛRval(ˆθλ), where Rval is the risk on the validation set and Λ := {0.001, 0.01, 0.03, 0.06, 0.1, 1, 10}.