Robust Active Distillation

Authors: Cenk Baykal, Khoa Trinh, Fotis Iliopoulos, Gaurav Menghani, Erik Vee

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present empirical evaluations on popular benchmarks that demonstrate the improved distillation performance enabled by our work relative to that of state-of-the-art active learning and active distillation methods. We present empirical evaluations on popular benchmarks that demonstrate the improved distillation performance enabled by our work relative to that of state-of-the-art active learning and active distillation methods.
Researcher Affiliation Industry Cenk Baykal, Khoa Trinh, Fotis Iliopoulos, Gaurav Menghani, Erik Vee Google Research {baykalc,khoatrinh,fotisi,gmenghani,erikvee}@google.com
Pseudocode Yes Algorithm 1 ACTIVEDISTILLATION; Algorithm 2 DEPROUND
Open Source Code No The paper does not provide an explicit statement about open-sourcing the code or a link to a code repository.
Open Datasets Yes We considered the CIFAR10/CIFAR100 (Krizhevsky et al., 2009), SVHN (Netzer et al., 2011), and Image Net (Deng et al., 2009) data sets.
Dataset Splits Yes We considered the CIFAR10/CIFAR100 (Krizhevsky et al., 2009), SVHN (Netzer et al., 2011), and Image Net (Deng et al., 2009) data sets. Unless otherwise specified, we use the Adam optimizer (Kingma & Ba, 2014) with a batch size of 128 with data set specific learning rate schedules. We follow the active distillation setting shown in Alg. 1 with various configurations. We used a validation data set of size 1, 000 for the CIFAR10, CIFAR100, and SVHN data sets, and used a validation data set of size 10, 000 for Image Net, respectively, to estimate m.
Hardware Specification Yes We use 64 Cloud TPU v4s each with two cores. We conduct our evaluations on 64 TPU v4s each with two cores.
Software Dependencies No The paper mentions 'Python' and 'TensorFlow (Abadi et al., 2015)' and 'Adam optimizer (Kingma & Ba, 2014)' but does not provide specific version numbers for TensorFlow or Python. It states 'We implemented all algorithms in Python and used the Tensor Flow (Abadi et al., 2015) deep learning library.'
Experiment Setup Yes Unless otherwise specified, we use the Adam optimizer (Kingma & Ba, 2014) with a batch size of 128 with data set specific learning rate schedules. We train the student model for 100 epochs using SGD with momentum (= 0.9) with batch size 256 and a learn rate schedule as follows. For the first 5 epochs, we linearly increase the learning rate from 0 to 0.1, the next 30 epochs we use a learning rate of 0.1, the next 30 after that, we use a learning rate of 0.01, the next 20 we use a learning rate of 0.001, and use a learning rate of 0.0001 for the remaining epochs. We used the Adam optimizer (Kingma & Ba, 2014) with the default parameters except for the learning rate schedule which was as follows. For a given number of epochs nepochs {100, 200}, we used 1e 3 as the learning rate for the first (2/5)nepochs, then used 1e 4 until (3/5)nepochs, 1e 5 until (4/5)nepochs, 1e 6 until (9/10)nepochs, and finally 5e 7 until then end. We used rounded values for the epoch windows that determine the learning rate schedule to integral values whenever necessary.