reproducibilityindex.ai

Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?

Authors: Lingao Xiao, Yang He

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments validate our discoveries. For example, when condensing Image Net-1K to 200 images per class, our approach compresses the required soft labels from 113 GB to 2.8 GB (40 compression) with a 2.6% performance gain. Code is available at: https://github.com/he-y/soft-label-pruning-for-dataset-distillation.
Researcher Affiliation	Collaboration	1CFAR, Agency for Science, Technology and Research, Singapore 2IHPC, Agency for Science, Technology and Research, Singapore 3National University of Singapore
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. Procedures are described in text or illustrated with diagrams (e.g., Figure 4), but not in a pseudocode format.
Open Source Code	Yes	Code is available at: https://github.com/he-y/soft-label-pruning-for-dataset-distillation.
Open Datasets	Yes	Dataset. Our experiment results are evaluated on Tiny-Image Net [33], Image Net-1K [34], and Image Net-21K-P [35]. [...] Tiny-Image Net [33] is the subset of Image Net-1K containing 500 images per class of a total of 200 classes, and spatial sizes of images are downsampled to 64 64. Image Net-1K [34] contains 1,000 classes and 1,281,167 images in total. The image sizes are resized to 224 224. Image Net-21K-P [35] is the pruned version of Image Net-21K, containing 10,450 classes and 11,060,223 images in total. Images are sized to 224 224 resolution.
Dataset Splits	No	The paper mentions using well-known datasets and adhering to previous preprocessing/validation settings (e.g., "For validation, we adhere to the hyperparameter settings of CDA [7]"), but it does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or counts) within its text.
Hardware Specification	Yes	Experiments are performed on 4 A100 80G GPU cards.
Software Dependencies	No	The paper mentions using "Pytorch pretrained Res Net-18" and "Timm pretrained model" but does not specify version numbers for PyTorch, Timm, or any other critical software dependencies.
Experiment Setup	Yes	Appendix C provides detailed hyperparameter settings in tables for different phases and datasets, such as Table 11 "Data Synthesis of Image Net-1K" which lists "Iteration 4,000", "Optimizer Adam", "Image LR 0.25", "Batch Size IPC-dependent", and "BN Loss (α) 0.01".