Radioactive data: tracing through training
Authors: Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Herve Jegou
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on large-scale benchmarks (Imagenet), with standard architectures (Resnet-18, VGG-16, Densenet-121) and training procedures, show that we detect radioactive data with high confidence (p <0.0001) when only 1% of the data used to train a model is radioactive. |
| Researcher Affiliation | Collaboration | 1Facebook AI Research, Paris 2Inria, Grenoble. |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | We employ the widely-used benchmarks Imagenet (Deng et al., 2009), a dataset of natural images with 1.2M images belonging to 1,000 classes and Places205 (Zhou et al., 2014), a dataset of 2.4M images from 205 scene categories. |
| Dataset Splits | No | While a 'validation set' is mentioned (Section 3.4: 'In practice, we use vanilla images of a held-out set (the validation set) to estimate M.'), the paper does not provide specific split percentages or sample counts for the training/validation/test sets to reproduce the data partitioning. |
| Hardware Specification | No | The paper mentions 'across 8 GPUs' but does not specify the model or type of GPUs or any other specific hardware components used for experiments. |
| Software Dependencies | No | The paper mentions 'We use Pytorch (Paszke et al., 2017)' but does not specify the version number of Pytorch or any other software dependencies with their versions. |
| Experiment Setup | Yes | We train with SGD with a momentum of 0.9 and a weight decay of 10 4 for 90 epochs, using a batch size of 2048 across 8 GPUs. We use the waterfall schedule for the learning rate: it starts at 0.8 and is divided by 10 every 30 epochs. Radioactive data are generated by running SGD by optimizing Equation (5) with R = 10, λ1 = 0.0005 and λ2 = 0.01. |