Undistillable: Making A Nasty Teacher That CANNOT teach students
Authors: Haoyu Ma, Tianlong Chen, Ting-Kuei Hu, Chenyu You, Xiaohui Xie, Zhangyang Wang
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on several datasets demonstrate that our method is effective on both standard KD and data-free KD, providing the desirable KD-immunity to model owners for the first time. Our codes and pre-trained models can be found at https://github.com/VITA-Group/Nasty-Teacher. |
| Researcher Affiliation | Academia | 1University of California, Irvine, 2University of Texas at Austin, 3Texas A&M University, 4Yale University {haoyum3,xhx}@uci.edu,{tianlong.chen, atlaswang}@utexas.edu, tkhu@tamu.edu, chenyu.you@yale.edu |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our codes and pre-trained models can be found at https://github.com/VITA-Group/Nasty-Teacher. |
| Open Datasets | Yes | We explore the effectiveness of our nasty teachers on three representative datasets, i.e., CIFAR-10, CIFAR-100, and Tiny-Image Net. |
| Dataset Splits | No | The paper mentions training epochs and learning rate schedules but does not explicitly provide information on validation dataset splits or percentages. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | The paper mentions optimizers like Adam and SGD but does not list specific software dependencies with version numbers (e.g., programming language, libraries, frameworks). |
| Experiment Setup | Yes | The distilling temperature τA for self-undermining training is set to 4 for CIFAR-10 and 20 for both CIFAR-100 and Tiny-Image Net as suggested in (Yuan et al., 2020). For the selection of ω, 0.004, 0.005, and 0.01 are picked for CIFAR-10, CIFAR-100, and Tiny-Image Net, respectively. For the plain CNN, we train it with a learning rate of 1e 3 for 100 epochs and optimize it by Adam optimizer (Kingma & Ba, 2014). Other networks are optimized by SGD optimizer with momentum 0.9 and weight decay 5e 4. The learning rate is initialized as 0.1. Networks are trained by 160 epochs with learning rate decayed by a factor of 10 at the 80th and 120th epoch for CIFAR-10, and 200 epochs with learning rate decayed by a factor of 5 at the 60th, 120th and 160th epoch for CIFAR-100 and Tiny-Image Net. |