Contrastive Representation Distillation
Authors: Yonglong Tian, Dilip Krishnan, Phillip Isola
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that our resulting new objective outperforms knowledge distillation and other cutting-edge distillers on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer. Our method sets a new state-of-the-art in many transfer tasks, and sometimes even outperforms the teacher network when combined with knowledge distillation. |
| Researcher Affiliation | Collaboration | Yonglong Tian MIT CSAIL yonglong@mit.edu Dilip Krishnan Google Research dilipkay@google.com Phillip Isola MIT CSAIL phillipi@mit.edu |
| Pseudocode | No | The paper does not include a distinct pseudocode block or algorithm listing. It describes its methods using mathematical equations and descriptive text. |
| Open Source Code | Yes | Code: http://github.com/Hobbit Long/Rep Distiller. |
| Open Datasets | Yes | Datasets (1) CIFAR-100 (Krizhevsky & Hinton, 2009) contains 50K training images... (2) Image Net (Deng et al., 2009) provides 1.2 million images from 1K classes for training... (3) STL-10 (Coates et al., 2011) consists of a training set of 5K labeled images... (4) Tiny Image Net (Deng et al., 2009) has 200 classes, each with 500 training images... (5) NYU-Depth V2 (Silberman et al., 2012) consists of 1449 indoor images, each labeled with dense depth image and semantic map. |
| Dataset Splits | Yes | Image Net (Deng et al., 2009) provides 1.2 million images from 1K classes for training and 50K for validation. |
| Hardware Specification | Yes | In practice, we did not notice significant difference of training time on Image Net (e.g., 1.75 epochs/hour v.s. 1.67 epochs/hour on two Titan-V GPUs). |
| Software Dependencies | No | The paper mentions 'Py Torch' but does not specify its version number or any other software dependencies with their specific versions. |
| Experiment Setup | Yes | For CIFAR-100, we initialize the learning rate as 0.05, and decay it by 0.1 every 30 epochs after the first 150 epochs until the last 240 epoch. For Mobile Net V2, Shuffle Net V1 and Shuffle Net V2, we use a learning rate of 0.01... Batch size is 64 for CIFAR-100 or 256 for Image Net. We have validated different N: 16, 64, 256, 1024, 4096, 16384. We varied τ between 0.02 and 0.3. All experiments but those on Image Net use a temperature of 0.1. For Image Net, we use τ = 0.07. |