reproducibilityindex.ai

Towards a statistical theory of data selection under weak supervision

Authors: Germain Kolossov, Andrea Montanari, Pulkit Tandon

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under lowand highdimensional asymptotics, we show that: (i) Data selection can be very effective, in particular beating training on the full sample in some cases; (ii) Certain popular choices in data selection methods (e.g. unbiased reweighted subsampling, or influence function-based subsampling) can be substantially suboptimal.
Researcher Affiliation	Industry	Germain Kolossov, Andrea Montanari, Pulkit Tandon Granica Computing, Inc. {germain.kolossov,andrea.montanari,pulkit.tandon}@granica.ai
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks clearly labeled as "Pseudocode" or "Algorithm".
Open Source Code	Yes	The code to reproduce theory and simulations with synthetic data can be found at the github repo.
Open Datasets	Yes	We use a subset of images obtained from the KITTI-360 train set (Liao et al., 2022).
Dataset Splits	Yes	We randomly partition this dataset into four disjoint sets: Ntrain = 34, 345 images to perform subsampling and train models, Nsurr = 14, 720 images to train surrogate models, Nval = 3665 images for validation and Ntest = 8550 images for reporting the final experiment results.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments.
Software Dependencies	No	The paper mentions "scikit-learn library (Pedregosa et al., 2011)" and "torch.hub.load( facebookresearch/swav:main , resnet50 )" but does not provide specific version numbers for these or other key software components (e.g., Python, PyTorch).
Experiment Setup	Yes	The training utilized the L-BFGS optimization algorithm with a cap of 10,000 iterations implemented using the scikit-learn library... The ridge regularization parameter λ. This is either fixed or selected optimally by taking λ = arg minλ ΛRval(ˆθλ), where Rval is the risk on the validation set and Λ := {0.001, 0.01, 0.03, 0.06, 0.1, 1, 10}.