On the Size and Approximation Error of Distilled Datasets

Authors: Alaa Maalouf, Murad Tukan, Noel Loo, Ramin Hasani, Mathias Lechner, Daniela Rus

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify our bounds analytically and empirically. ... 5 Experimental Study: To validate our theoretical bounds, we performed distillation on three datasets: two synthetic datasets ... and one real dataset of MNIST binary and multi-class classification. Full experimental details for all experiments are available in the appendix.
Researcher Affiliation Collaboration Alaa Maalouf MIT CSAIL Murad Tukan Data Heroes Noel Loo MIT CSAIL Ramin Hasani MIT CSAIL Mathias Lechner MIT CSAIL Daniela Rus MIT CSAIL
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available in the supplementary material.
Open Datasets Yes For our next test, we first consider binary classification on (i) MNIST 0 and 1 digits, (ii) SVHN 0 and 1 digit, and (iii) CIFAR-10 ship vs deer; all with labels −1 and +1, respectively.
Dataset Splits No The paper mentions using standard datasets like MNIST, SVHN, and CIFAR-10, which have predefined splits. However, it does not explicitly state the training, validation, or test split percentages or sample counts, nor does it cite the specific predefined splits used for reproducibility, which is required by the prompt criteria.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as GPU/CPU models or cloud instance types.
Software Dependencies No The paper mentions using the Adam optimizer but does not specify version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup Yes We distill for 20000 iterations with Adam optimizer with a learning rate of 0.002 optimizing both images/data positions and labels. We use full batch gradient descent for the synthetic datasets and a maximum batch size of 2000 for the MNIST experiment. For the MNIST experiment we found that particularly for larger values of n, with minibatch training, we could obtain lower distillation losses by optimizing for longer, so the closing of the gap between the upper bound and experiment values in fig. 4 may be misleading: longer optimization could bring the actual distillation loss lower. We fix λ = 10−5 and distill down to s = dλ k log dλ k. We use a squared exponential kernel with lengthscale parameter l = 1.5: k(x, x ) = e ||x−x ||2 2 2l2 . We then sample y ∼ N(0, KXX + σ2 y In), σy = 0.01.