Towards a statistical theory of data selection under weak supervision
Authors: Germain Kolossov, Andrea Montanari, Pulkit Tandon
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under lowand highdimensional asymptotics, we show that: (i) Data selection can be very effective, in particular beating training on the full sample in some cases; (ii) Certain popular choices in data selection methods (e.g. unbiased reweighted subsampling, or influence function-based subsampling) can be substantially suboptimal. |
| Researcher Affiliation | Industry | Germain Kolossov, Andrea Montanari, Pulkit Tandon Granica Computing, Inc. {germain.kolossov,andrea.montanari,pulkit.tandon}@granica.ai |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks clearly labeled as "Pseudocode" or "Algorithm". |
| Open Source Code | Yes | The code to reproduce theory and simulations with synthetic data can be found at the github repo. |
| Open Datasets | Yes | We use a subset of images obtained from the KITTI-360 train set (Liao et al., 2022). |
| Dataset Splits | Yes | We randomly partition this dataset into four disjoint sets: Ntrain = 34, 345 images to perform subsampling and train models, Nsurr = 14, 720 images to train surrogate models, Nval = 3665 images for validation and Ntest = 8550 images for reporting the final experiment results. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments. |
| Software Dependencies | No | The paper mentions "scikit-learn library (Pedregosa et al., 2011)" and "torch.hub.load( facebookresearch/swav:main , resnet50 )" but does not provide specific version numbers for these or other key software components (e.g., Python, PyTorch). |
| Experiment Setup | Yes | The training utilized the L-BFGS optimization algorithm with a cap of 10,000 iterations implemented using the scikit-learn library... The ridge regularization parameter λ. This is either fixed or selected optimally by taking λ = arg minλ ΛRval(ˆθλ), where Rval is the risk on the validation set and Λ := {0.001, 0.01, 0.03, 0.06, 0.1, 1, 10}. |