Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Towards a statistical theory of data selection under weak supervision
Authors: Germain Kolossov, Andrea Montanari, Pulkit Tandon
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under lowand highdimensional asymptotics, we show that: (i) Data selection can be very effective, in particular beating training on the full sample in some cases; (ii) Certain popular choices in data selection methods (e.g. unbiased reweighted subsampling, or influence function-based subsampling) can be substantially suboptimal. |
| Researcher Affiliation | Industry | Germain Kolossov, Andrea Montanari, Pulkit Tandon Granica Computing, Inc. EMAIL |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks clearly labeled as "Pseudocode" or "Algorithm". |
| Open Source Code | Yes | The code to reproduce theory and simulations with synthetic data can be found at the github repo. |
| Open Datasets | Yes | We use a subset of images obtained from the KITTI-360 train set (Liao et al., 2022). |
| Dataset Splits | Yes | We randomly partition this dataset into four disjoint sets: Ntrain = 34, 345 images to perform subsampling and train models, Nsurr = 14, 720 images to train surrogate models, Nval = 3665 images for validation and Ntest = 8550 images for reporting the final experiment results. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments. |
| Software Dependencies | No | The paper mentions "scikit-learn library (Pedregosa et al., 2011)" and "torch.hub.load( facebookresearch/swav:main , resnet50 )" but does not provide specific version numbers for these or other key software components (e.g., Python, PyTorch). |
| Experiment Setup | Yes | The training utilized the L-BFGS optimization algorithm with a cap of 10,000 iterations implemented using the scikit-learn library... The ridge regularization parameter λ. This is either fixed or selected optimally by taking λ = arg minλ ΛRval(ˆθλ), where Rval is the risk on the validation set and Λ := {0.001, 0.01, 0.03, 0.06, 0.1, 1, 10}. |