Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Dataset Distillation via Curriculum Data Synthesis in Large Data Era

Authors: Zeyuan Yin, Zhiqiang Shen

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The proposed framework achieves the current published highest accuracy on both large-scale Image Net-1K and 21K with 63.2% under IPC (Images Per Class) 50 and 36.1% under IPC 20, using a regular input resolution of 224 224 with faster convergence speed and less synthetic time. ... We conduct extensive experiments on the CIFAR, Tiny-Image Net, Image Net-1K, and Image Net-21K datasets. Employing a resolution of 224 224 and IPC 50 on Image Net-1K, the proposed approach attains an impressive accuracy of 63.2%, surpassing all prior state-of-the-art methods by substantial margins.
Researcher Affiliation Academia Zeyuan Yin EMAIL VILA Lab Mohamed bin Zayed University of Artificial Intelligence Zhiqiang Shen EMAIL VILA Lab Mohamed bin Zayed University of Artificial Intelligence
Pseudocode Yes Algorithm 1: Our CDA via Random Resized Crop Input: squeezed model ϕθ, recovery iteration S, curriculum milestone T, target label y, default lower and upper bounds of crop scale βl and βu in Random Resized Crop, decay of lower scale bound γ Output: synthetic image x Initialize: x0 from a standard normal distribution for step s from 0 to S-1 do if s T then βu if step βl + γ (βu s/T) if linear βl + γ (βu + cos (π s/T)) /2 if cosine else α βl end x T Random Resized Crop(xs, min_crop = α, max_crop = βu) x T x T is optimized w.r.t ϕθ and y in Eq. 7. xs+1 Reverse Random Resized Crop(xs, x T ) end return x x S
Open Source Code Yes Our code and distilled Image Net-21K dataset of 20 IPC, 2K recovery budget are available at https://github.com/VILA-Lab/SRe2L/tree/main/CDA.
Open Datasets Yes We verify the effectiveness of our approach on small-scale CIFAR-100 and various Image Net scale datasets, including Tiny-Image Net (Le & Yang, 2015), Image Net-1K (Deng et al., 2009), and Image Net-21K (Ridnik et al., 2021).
Dataset Splits Yes We verify the effectiveness of our approach on small-scale CIFAR-100 and various Image Net scale datasets, including Tiny-Image Net (Le & Yang, 2015), Image Net-1K (Deng et al., 2009), and Image Net-21K (Ridnik et al., 2021). For evaluation, we train models from scratch on synthetic distilled datasets and report the Top-1 accuracy on real validation datasets. (Appendix B.1, B.2, B.3, B.4 also refer to 'training data' and 'validation setting' for these standard datasets.)
Hardware Specification Yes Specifically, for Image Net-1K, it takes about 29 hours to generate the distilled Image Net-1K with 50 IPC on a single A100 (40G) GPU and the peak GPU memory utilization is 6.7GB. For Image Net-21K, it takes 11 hours to generate Image Net-21K images per IPC on a single RTX 4090 GPU and the peak GPU memory utilization is 15GB. In our experiment, it takes about 55 hours to generate the entire distilled Image Net-21K with 20 IPC on 4 RTX 4090 GPUs in total.
Software Dependencies No The information is insufficient. The paper mentions 'PyTorch's pre-trained Mobile Net-V2' but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Appendix B Implementation Details... Table 10: Hyper-parameter settings on CIFAR-100. ... Table 12: Hyper-parameter settings on Tiny-Image Net. ... Table 16: Hyper-parameter settings on Image Net-1K. ... Table 20: Hyper-parameter settings on Image Net-21K. ... We use a relatively large label smooth of 0.2 together with Cutout (De Vries & Taylor, 2017) and Rand Augment (Cubuk et al., 2020)...