Scaffolding a Student to Instill Knowledge
Authors: Anil Kag, Durmus Alp Emre Acar, Aditya Gangrade, Venkatesh Saligrama
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show on synthetic examples that censoring hard-examples leads to smoothening the student’s loss landscape so that the student encounters fewer local minima. As a result, it has good generalization properties. Against vanilla KD, we achieve improved performance and are comparable to more intrusive techniques that leverage feature matching on benchmark datasets. |
| Researcher Affiliation | Academia | ECE Department, Boston University, Boston, MA Statistics Department, Carnegie Mellon University, Pittsburgh, PA |
| Pseudocode | Yes | Algorithm 1 Di SK: Distilling Selective Knowledge. |
| Open Source Code | Yes | We avail our code at https://github.com/anilkagak2/Di SK_Distilling_Scaffolded_Knowledge |
| Open Datasets | Yes | We use publicly available CIFAR-100 (Krizhevsky, 2009), Tiny-Imagenet (Le & Yang, 2015) datasets. CIFAR-100 contains 50K training and 10K test images from 100 classes with size 32 × 32 × 3. While Tiny-Imagenet contains 100K training and 10K test images from 200 classes with size 64 × 64 × 3. |
| Dataset Splits | Yes | We draw an independent validation set of 100 data points for hyper-parameter tuning. |
| Hardware Specification | No | The paper discusses computational requirements (MACs) and parameters of models (Table 2, Table 5, Table 6) and talks about 'saving resources' and 'larger compute resources', but it does not specify any particular hardware (e.g., GPU models, CPU types, or cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper mentions using SGD as the optimizer and refers to the 'timm' library (Wightman, 2019) for models. However, it does not specify version numbers for any software dependencies like Python, PyTorch, or the timm library itself, which are necessary for reproducible setup. |
| Experiment Setup | Yes | For each method, we train models for 200 epochs using SGD as the optimizer with 0.9 momentum and 0.1 learning rate. See Appx. A.5 for more training details. We use 200 as the batch size. For both, KD and Di SK, we scan the α hyper-parameter over the range {0.0, 0.1, 0.5, 0.9, 1.0}. As per recommendations from previous works(Chen et al., 2022; Cho & Hariharan, 2019; Tung & Mori, 2019), we use τ = 4 as the temperature in Eq. 1. For Di SK, we scan the different hyper-parameters in the following ranges: (a) τs {1, 2, 4}, (b) K {1, 3, 5, 10, 20, 50}, (c) λmin {0.01, 0.1, 1, 5, 10}, (d) λmax {1, 5, 10, 20, 50, 100, 1000}, (e) Budget δ within 0.2 distance from the cross-entropy trained student model s error, and (f) λT {20, 50}. We replace the arg min in the Algorithm 1, with three SGD steps over the entire dataset. |