Learning to Scaffold: Optimizing Model Explanations for Teaching

Authors: Patrick Fernandes, Marcos Treviso, Danish Pruthi, André Martins, Graham Neubig

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train models on three natural language processing and computer vision tasks, and find that students trained with explanations extracted with our framework are able to simulate the teacher significantly more effectively than ones produced with previous methods.
Researcher Affiliation Collaboration ΨLanguage Technologies Institute, Carnegie Mellon University, Pittsburgh, PA ΩInstituto Superior Técnico & LUMLIS (Lisbon ELLIS Unit), Lisbon, Portugal ℜInstituto de Telecomunicações, Lisbon, Portugal ΛAmazon Web Services ΓUnbabel, Lisbon, Portugal
Pseudocode No The paper describes its optimization process and framework using mathematical equations and textual explanations, but it does not include any explicitly labeled pseudocode blocks or algorithm listings.
Open Source Code Yes Our code is available at https://github.com/coderpat/learning-scaffold.
Open Datasets Yes For text classification, we consider the IMDB dataset [Maas et al., 2011]... we consider image classification on the CIFAR-100 dataset [Krizhevsky, 2009]... We use the MLQE-PE dataset [Fomicheva et al., 2020]
Dataset Splits Yes We split the original CIFAR-100 training set into a new training set with 45,000 and a validation set with 5,000.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory configurations used for running the experiments. It mentions using software libraries like JAX, Huggingface Transformers, and Flax, but no underlying hardware specifications.
Software Dependencies No The paper mentions key software components such as JAX [Bradbury et al., 2018], Huggingface Transformers library [Wolf et al., 2020], and Flax [Heek et al., 2020]. While these are cited, specific version numbers (e.g., JAX vX.Y.Z) are not provided, which is necessary for full reproducibility.
Experiment Setup Yes For each task, we train a teacher model with Adam W [Loshchilov and Hutter, 2019] but, as explained in 3, we use SGD for the student model (inner loop). We also use scalar mixing [Peters et al., 2018] to pool representations from different layers automatically... We use the Kullback Leibler divergence as Lexpl, and we set β = 5 for attention-based explainers and β = 0.2 for gradient-based explainers (since we found smaller values to be better). We set Lsim as the cross-entropy loss for classification tasks, and as the mean squared error loss for text regression.