reproducibilityindex.ai

What is Dataset Distillation Learning?

Authors: William Yang, Ye Zhu, Zhiwei Deng, Olga Russakovsky

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this study, we posit and answer three questions about the behavior, representativeness, and point-wise information content of distilled data. We reveal distilled data cannot serve as a substitute for real data during training outside the standard evaluation setting for dataset distillation. Additionally, the distillation process retains high task performance by compressing information related to the early training dynamics of real models. Finally, we provide an framework for interpreting distilled data and reveal that individual distilled data points contain meaningful semantic information. Experimental setup. We leverage the CIFAR-10 (Krizhevsky et al., 2009) dataset for our analysis with additional analysis on other datasets in Section B of the appendix.
Researcher Affiliation	Collaboration	1Department of Computer Science, Princeton University, Princeton NJ, United States 2Google Research, Mountain View CA, United States.
Pseudocode	No	The paper describes various dataset distillation algorithms (BPTT, distribution matching, gradient matching, trajectory matching) with mathematical formulations, but it does not include any formal pseudocode blocks or sections explicitly labeled 'Algorithm'.
Open Source Code	Yes	1https://github.com/princetonvisualai/ What-is-Dataset-Distillation-Learning
Open Datasets	Yes	We leverage the CIFAR-10 (Krizhevsky et al., 2009) dataset for our analysis with additional analysis on other datasets in Section B of the appendix. We extend our analysis on mixing with real data and the model predictions analysis to the CIFAR-100 dataset (Krizhevsky et al., 2009). Similarly, we extend mixture experiments and our model prediction analysis to the Tiny Image Net dataset (Le & Yang, 2015).
Dataset Splits	No	The paper discusses training on CIFAR-10, CIFAR-100, and Tiny Image Net datasets and evaluating on test sets, but it does not explicitly provide details about training/validation/test splits (e.g., percentages, sample counts, or specific predefined validation sets) needed to reproduce data partitioning for validation.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies	No	The paper mentions software components such as 'SGD optimizer', 'Py Hessian', and 'LLaVA', but it does not provide specific version numbers for these or any other software dependencies, which are necessary for reproducibility.
Experiment Setup	Yes	We use the standard three layers deep, 128 filters wide convolutional neural networks to train on distilled data and real data with 0.01 learning rate and 0.9 momentum for 300 iterations using SGD optimizer.