Knowledge distillation via softmax regression representation learning

Authors: Jing Yang, Brais Martinez, Adrian Bulat, Georgios Tzimiropoulos

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method is extremely simple to implement and straightforward to train and is shown to consistently outperform previous state-of-the-art methods over a large set of experimental settings including different (a) network architectures, (b) teacher-student capacities, (c) datasets, and (d) domains.
Researcher Affiliation Collaboration Jing Yang University of Nottingham Nottingham, UK jing.yang2@nottingham.ac.uk Brais Marinez Samsung AI Center Cambridge, UK brais.mart@gmail.com Adrian Bulat Samsung AI Center Cambridge, UK adrian@adrianbulat.com, Georgios Tzimiropoulos Samsung AI Center Cambridge, UK Queen Mary University of London London, UK g.tzimiropoulos@qmul.ac.uk
Pseudocode Yes Algorithm 1 Knowledge distillation via Softmax Regression Representation Learning
Open Source Code Yes The code is available at https://github.com/jingyang2017/KD_SRRL.
Open Datasets Yes CIFAR-10 is a popular image classification dataset consisting of 50,000 training and 10,000 testing images equally distributed across 10 classes. ... For CIFAR-100 (Krizhevsky & Hinton, 2009)... Image Net-1K (Russakovsky et al., 2015).
Dataset Splits No The paper provides specific counts for training and testing images for CIFAR-10 (50,000 training and 10,000 testing), and mentions training and evaluation for ImageNet, but does not explicitly detail a separate validation split with specific counts or percentages.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or cloud computing instance types used for running experiments.
Software Dependencies No The paper mentions using 'pretrained Py Torch models Paszke et al. (2017)', which indicates the use of PyTorch, but does not specify a version number for PyTorch or any other software dependency.
Experiment Setup Yes The Res Net models were trained for 350 epochs using SGD. The initial learning rate was set to 0.1, and then it was reduced by a factor of 10 at epochs 150, 250 and 320. Similarly, the WRN models were trained for 200 epochs with a learning rate of 0.1 that was subsequently reduced by 5 at epochs 60, 120 and 160. In all experiments, we set the dropout rate to 0. ... Batch size was set to 128. ... We used SGD with Nesterov momentum 0.9, weight decay 1e 4, initial learning rate 0.2 which was then dropped by a factor of 10 every 30 epochs, training in total for 100 epochs.