Scaffolding a Student to Instill Knowledge

Authors: Anil Kag, Durmus Alp Emre Acar, Aditya Gangrade, Venkatesh Saligrama

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show on synthetic examples that censoring hard-examples leads to smoothening the student’s loss landscape so that the student encounters fewer local minima. As a result, it has good generalization properties. Against vanilla KD, we achieve improved performance and are comparable to more intrusive techniques that leverage feature matching on benchmark datasets.
Researcher Affiliation Academia ECE Department, Boston University, Boston, MA Statistics Department, Carnegie Mellon University, Pittsburgh, PA
Pseudocode Yes Algorithm 1 Di SK: Distilling Selective Knowledge.
Open Source Code Yes We avail our code at https://github.com/anilkagak2/Di SK_Distilling_Scaffolded_Knowledge
Open Datasets Yes We use publicly available CIFAR-100 (Krizhevsky, 2009), Tiny-Imagenet (Le & Yang, 2015) datasets. CIFAR-100 contains 50K training and 10K test images from 100 classes with size 32 × 32 × 3. While Tiny-Imagenet contains 100K training and 10K test images from 200 classes with size 64 × 64 × 3.
Dataset Splits Yes We draw an independent validation set of 100 data points for hyper-parameter tuning.
Hardware Specification No The paper discusses computational requirements (MACs) and parameters of models (Table 2, Table 5, Table 6) and talks about 'saving resources' and 'larger compute resources', but it does not specify any particular hardware (e.g., GPU models, CPU types, or cloud instances) used for running the experiments.
Software Dependencies No The paper mentions using SGD as the optimizer and refers to the 'timm' library (Wightman, 2019) for models. However, it does not specify version numbers for any software dependencies like Python, PyTorch, or the timm library itself, which are necessary for reproducible setup.
Experiment Setup Yes For each method, we train models for 200 epochs using SGD as the optimizer with 0.9 momentum and 0.1 learning rate. See Appx. A.5 for more training details. We use 200 as the batch size. For both, KD and Di SK, we scan the α hyper-parameter over the range {0.0, 0.1, 0.5, 0.9, 1.0}. As per recommendations from previous works(Chen et al., 2022; Cho & Hariharan, 2019; Tung & Mori, 2019), we use τ = 4 as the temperature in Eq. 1. For Di SK, we scan the different hyper-parameters in the following ranges: (a) τs {1, 2, 4}, (b) K {1, 3, 5, 10, 20, 50}, (c) λmin {0.01, 0.1, 1, 5, 10}, (d) λmax {1, 5, 10, 20, 50, 100, 1000}, (e) Budget δ within 0.2 distance from the cross-entropy trained student model s error, and (f) λT {20, 50}. We replace the arg min in the Algorithm 1, with three SGD steps over the entire dataset.