Self-Supervised Dataset Distillation for Transfer Learning

Authors: Dong Bok Lee, Seanie Lee, Joonho Ko, Kenji Kawaguchi, Juho Lee, Sung Ju Hwang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate the effectiveness of our method on transfer learning. Our code is available at https://github.com/db-Lee/selfsup dd. We empirically show that our proposed KRR-ST significantly outperforms the supervised dataset distillation methods in transfer learning experiments, where we condense a source dataset, which is either CIFAR100 (Krizhevsky et al., 2009), Tiny Image Net (Le & Yang, 2015), or Image Net (Deng et al., 2009), into a small set, pre-train models with different architectures on the condensed dataset, and fine-tune all the models on target labeled datasets such as CIFAR10, Aircraft (Maji et al., 2013), Stanford Cars (Krause et al., 2013), CUB2011 (Wah et al., 2011), Stanford Dogs (Khosla et al., 2011), and Flowers (Nilsback & Zisserman, 2008).
Researcher Affiliation Academia Dong Bok Lee1 , Seanie Lee1 , Joonho Ko1, Kenji Kawaguchi2, Juho Lee1, Sung Ju Hwang1 KAIST1, National University of Singapore2 {markhi, lsnfamily02, joonho.ko, juholee, sjhwang82}@kaist.ac.kr kenji@comp.nus.edu.sg
Pseudocode Yes Algorithm 1 Kernel Ridge Regression on Self-supervised Target (KRR-ST)
Open Source Code Yes Our code is available at https://github.com/db-Lee/selfsup dd.
Open Datasets Yes Datasets We use either CIFAR100 (Krizhevsky et al., 2009), Tiny Image Net (Le & Yang, 2015), or Image Net (Deng et al., 2009) as a source dataset for dataset distillation, while evaluating the distilled dataset on CIFAR10 (Krizhevsky et al., 2009), Aircraft (Maji et al., 2013), Stanford Cars (Krause et al., 2013), CUB2011 (Wah et al., 2011), Stanford Dogs (Khosla et al., 2011), and Flowers (Nilsback & Zisserman, 2008).
Dataset Splits No The paper mentions using "target training dataset" and "target test dataset" for common public datasets (CIFAR100, Tiny Image Net, Image Net, CIFAR10, etc.), implying standard splits. However, it does not explicitly state specific percentages, sample counts, or explicitly cite predefined splits (e.g., "we use the standard train/validation/test split for CIFAR10").
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications. It only mentions "GPUs" in the ethics statement, not in the experimental setup.
Software Dependencies No The paper states: "We use Pytorch (Paszke et al., 2019) to implement our self-supervised dataset distillation method, KRR-ST." While PyTorch is mentioned, a specific version number for PyTorch itself or other critical software dependencies (e.g., Python version, CUDA, specific libraries) is not provided.
Experiment Setup Yes We choose the number of layers based on the resolution of images, i.e., 3 layers for 32 32 and 4 layers for 64 64, respectively. We initialize and maintain l = 10 models for the model pool M, and update the models in the pool using full-batch gradient descent with learning rate, momentum, and weight decay being set to 0.1, 0.9, and 0.001, respectively. The total number of steps T is set to 1,000. We meta-update our distilled dataset for 160,000 iterations using Adam W optimizer (Loshchilov & Hutter, 2019) with an initial learning rate of 0.001, 0.00001, and 0.00001 for CIFAR100, Tiny Image Net, and Image Net, respetively. The learning rate is linearly decayed. We use Res Net18 (He et al., 2016) as a self-supervised target model gϕ which is trained on a source dataset with Barlow Twins (Zbontar et al., 2021) objective. After distillation, we pre-train a model on the distilled dataset for 1,000 epochs with a mini-batch size of 256 using stochastic gradient descent (SGD) optimizer, where learning rate, momentum, and weight decay are set to 0.1, 0.9, and 0.001, respectively. For the baselines, we follow their original experimental setup to pre-train a model on their condensed dataset. For fine-tuning, all the experimental setups are fixed as follows: we use the SGD optimizer with learning rate of 0.01, momentum of 0.9 and weight decay of 0.0005. We fine-tune the models for 10,000 iterations (CIFAR100, CIFAR10, and Tiny Imagenet), or 5,000 iterations (Aircraft, Cars, CUB2011, Dogs, and Flowers) with a mini-batch size of 256. The learning rate is decayed with cosine scheduling.