Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scaffolding a Student to Instill Knowledge

Authors: Anil Kag, Durmus Alp Emre Acar, Aditya Gangrade, Venkatesh Saligrama

ICLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show on synthetic examples that censoring hard-examples leads to smoothening the student’s loss landscape so that the student encounters fewer local minima. As a result, it has good generalization properties. Against vanilla KD, we achieve improved performance and are comparable to more intrusive techniques that leverage feature matching on benchmark datasets.
Researcher Affiliation	Academia	ECE Department, Boston University, Boston, MA Statistics Department, Carnegie Mellon University, Pittsburgh, PA
Pseudocode	Yes	Algorithm 1 Di SK: Distilling Selective Knowledge.
Open Source Code	Yes	We avail our code at https://github.com/anilkagak2/Di SK_Distilling_Scaffolded_Knowledge
Open Datasets	Yes	We use publicly available CIFAR-100 (Krizhevsky, 2009), Tiny-Imagenet (Le & Yang, 2015) datasets. CIFAR-100 contains 50K training and 10K test images from 100 classes with size 32 × 32 × 3. While Tiny-Imagenet contains 100K training and 10K test images from 200 classes with size 64 × 64 × 3.
Dataset Splits	Yes	We draw an independent validation set of 100 data points for hyper-parameter tuning.
Hardware Specification	No	The paper discusses computational requirements (MACs) and parameters of models (Table 2, Table 5, Table 6) and talks about 'saving resources' and 'larger compute resources', but it does not specify any particular hardware (e.g., GPU models, CPU types, or cloud instances) used for running the experiments.
Software Dependencies	No	The paper mentions using SGD as the optimizer and refers to the 'timm' library (Wightman, 2019) for models. However, it does not specify version numbers for any software dependencies like Python, PyTorch, or the timm library itself, which are necessary for reproducible setup.
Experiment Setup	Yes	For each method, we train models for 200 epochs using SGD as the optimizer with 0.9 momentum and 0.1 learning rate. See Appx. A.5 for more training details. We use 200 as the batch size. For both, KD and Di SK, we scan the α hyper-parameter over the range {0.0, 0.1, 0.5, 0.9, 1.0}. As per recommendations from previous works(Chen et al., 2022; Cho & Hariharan, 2019; Tung & Mori, 2019), we use τ = 4 as the temperature in Eq. 1. For Di SK, we scan the different hyper-parameters in the following ranges: (a) τs {1, 2, 4}, (b) K {1, 3, 5, 10, 20, 50}, (c) λmin {0.01, 0.1, 1, 5, 10}, (d) λmax {1, 5, 10, 20, 50, 100, 1000}, (e) Budget δ within 0.2 distance from the cross-entropy trained student model s error, and (f) λT {20, 50}. We replace the arg min in the Algorithm 1, with three SGD steps over the entire dataset.