Knowledge distillation via softmax regression representation learning
Authors: Jing Yang, Brais Martinez, Adrian Bulat, Georgios Tzimiropoulos
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method is extremely simple to implement and straightforward to train and is shown to consistently outperform previous state-of-the-art methods over a large set of experimental settings including different (a) network architectures, (b) teacher-student capacities, (c) datasets, and (d) domains. |
| Researcher Affiliation | Collaboration | Jing Yang University of Nottingham Nottingham, UK jing.yang2@nottingham.ac.uk Brais Marinez Samsung AI Center Cambridge, UK brais.mart@gmail.com Adrian Bulat Samsung AI Center Cambridge, UK adrian@adrianbulat.com, Georgios Tzimiropoulos Samsung AI Center Cambridge, UK Queen Mary University of London London, UK g.tzimiropoulos@qmul.ac.uk |
| Pseudocode | Yes | Algorithm 1 Knowledge distillation via Softmax Regression Representation Learning |
| Open Source Code | Yes | The code is available at https://github.com/jingyang2017/KD_SRRL. |
| Open Datasets | Yes | CIFAR-10 is a popular image classification dataset consisting of 50,000 training and 10,000 testing images equally distributed across 10 classes. ... For CIFAR-100 (Krizhevsky & Hinton, 2009)... Image Net-1K (Russakovsky et al., 2015). |
| Dataset Splits | No | The paper provides specific counts for training and testing images for CIFAR-10 (50,000 training and 10,000 testing), and mentions training and evaluation for ImageNet, but does not explicitly detail a separate validation split with specific counts or percentages. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or cloud computing instance types used for running experiments. |
| Software Dependencies | No | The paper mentions using 'pretrained Py Torch models Paszke et al. (2017)', which indicates the use of PyTorch, but does not specify a version number for PyTorch or any other software dependency. |
| Experiment Setup | Yes | The Res Net models were trained for 350 epochs using SGD. The initial learning rate was set to 0.1, and then it was reduced by a factor of 10 at epochs 150, 250 and 320. Similarly, the WRN models were trained for 200 epochs with a learning rate of 0.1 that was subsequently reduced by 5 at epochs 60, 120 and 160. In all experiments, we set the dropout rate to 0. ... Batch size was set to 128. ... We used SGD with Nesterov momentum 0.9, weight decay 1e 4, initial learning rate 0.2 which was then dropped by a factor of 10 every 30 epochs, training in total for 100 epochs. |