Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Teacher’s pet: understanding and mitigating biases in distillation
Authors: Michal Lukasik, Srinadh Bhojanapalli, Aditya Krishna Menon, Sanjiv Kumar
TMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on several image classification benchmarks show that these modifications of distillation maintain boost in overall accuracy, while additionally ensuring improvement in subgroup performance. ... We report results on the datasets used in 3: CIFAR-100, Image Net; and long-tailed (LT) versions of the same. ... Table 3 summarises the results for all methods. |
| Researcher Affiliation | Industry | Michal Lukasik EMAIL Google Research Srinadh Bhojanapalli EMAIL Google Research Aditya Krishna Menon EMAIL Google Research Sanjiv Kumar EMAIL Google Research |
| Pseudocode | No | The paper does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format. |
| Open Source Code | No | The paper does not contain an unambiguous statement of code release or a direct link to a source code repository. |
| Open Datasets | Yes | Experiments on several image classification benchmarks show that these modifications of distillation maintain boost in overall accuracy, while additionally ensuring improvement in subgroup performance. ... We train a Res Net-56 teacher on CIFAR-100-LT, a long-tailed version of CIFAR-100 (Cui et al., 2019; Cao et al., 2019) ... For Image Net, we use the long-tailed version from Liu et al. (2019). ... We confirm this can indeed hold on the UCI Adult dataset using random forest models (details in Appendix C.3). |
| Dataset Splits | Yes | For the Ada-* methods, per 4, creating the label-dependent αy requires estimating the teacher s generalisation performance. To do this, we create a random holdout split of the training set. For non-LT datasets, we randomly split into 80% (new train) 20% (dev). For LT datasets, for each class we hold out k examples into the dev set (k = 50 for Imagenet-LT, k = 20 for CIFAR-100-LT), or half of examples for a class if the total number of per class examples is 2k. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions using SGD and ResNet architectures, but it does not specify any software libraries, packages, or their version numbers that would be required to reproduce the experiments. |
| Experiment Setup | Yes | For all datasets, we train using SGD and weight decay 10-4 for CIFAR, and 0.5 10-4 for Imagenet datasets. ... CIFAR-100. We train for 450 epochs with an initial learning rate of 1.0, with a linear warmup in the first 15 epochs, and an annealed learning rate schedule. We drop the learning rate by a factor of 10 at epochs number: 200, 300 and 400. We use a mini-batch size of 1024. We use SGD with Nesterov momentum of 0.9. For our distillation experiments we train only with the cross-entropy objective against the teacher s logits. For each method we find the best temperature from the list of values: {1, 2, 3, 4, 5}. Image Net. We train for 90 epochs with an initial learning rate of 0.8, with a linear warmup in the first 5 epochs, and an annealed learning rate schedule. We drop the learning rate by a factor of 10 at epochs number: 30, 60 and 80. We use a mini-batch size of 1024. For our distillation experiments we train with the distillation objective as defined in Equation 1 setting α = 0.2. For each method we fix the temperature to 0.9. Long-tail (LT) datasets. We follow setup as in the non-long tail version, except for the learning rate schedule, which we change to follow the cosine schedule (Loshchilov & Hutter, 2017). |