Learning to Scaffold: Optimizing Model Explanations for Teaching
Authors: Patrick Fernandes, Marcos Treviso, Danish Pruthi, André Martins, Graham Neubig
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train models on three natural language processing and computer vision tasks, and find that students trained with explanations extracted with our framework are able to simulate the teacher significantly more effectively than ones produced with previous methods. |
| Researcher Affiliation | Collaboration | ΨLanguage Technologies Institute, Carnegie Mellon University, Pittsburgh, PA ΩInstituto Superior Técnico & LUMLIS (Lisbon ELLIS Unit), Lisbon, Portugal ℜInstituto de Telecomunicações, Lisbon, Portugal ΛAmazon Web Services ΓUnbabel, Lisbon, Portugal |
| Pseudocode | No | The paper describes its optimization process and framework using mathematical equations and textual explanations, but it does not include any explicitly labeled pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Our code is available at https://github.com/coderpat/learning-scaffold. |
| Open Datasets | Yes | For text classification, we consider the IMDB dataset [Maas et al., 2011]... we consider image classification on the CIFAR-100 dataset [Krizhevsky, 2009]... We use the MLQE-PE dataset [Fomicheva et al., 2020] |
| Dataset Splits | Yes | We split the original CIFAR-100 training set into a new training set with 45,000 and a validation set with 5,000. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory configurations used for running the experiments. It mentions using software libraries like JAX, Huggingface Transformers, and Flax, but no underlying hardware specifications. |
| Software Dependencies | No | The paper mentions key software components such as JAX [Bradbury et al., 2018], Huggingface Transformers library [Wolf et al., 2020], and Flax [Heek et al., 2020]. While these are cited, specific version numbers (e.g., JAX vX.Y.Z) are not provided, which is necessary for full reproducibility. |
| Experiment Setup | Yes | For each task, we train a teacher model with Adam W [Loshchilov and Hutter, 2019] but, as explained in 3, we use SGD for the student model (inner loop). We also use scalar mixing [Peters et al., 2018] to pool representations from different layers automatically... We use the Kullback Leibler divergence as Lexpl, and we set β = 5 for attention-based explainers and β = 0.2 for gradient-based explainers (since we found smaller values to be better). We set Lsim as the cross-entropy loss for classification tasks, and as the mean squared error loss for text regression. |