Churn Reduction via Distillation

Authors: Heinrich Jiang, Harikrishna Narasimhan, Dara Bahri, Andrew Cotter, Afshin Rostamizadeh

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our method across a large number of datasets and neural network architectures.5 EXPERIMENTS
Researcher Affiliation Industry Heinrich Jiang, Harikrishna Narasimhan, Dara Bahri, Andrew Cotter, Afshin Rostamizadeh Google Research {heinrichj, hnarasimhan, dbahri, acotter, rostami}@google.com
Pseudocode Yes Algorithm 1 Distillation-based Churn Reduction
Open Source Code Yes Reproducibility Statement: All details of experimental setup are in the main text, along with descriptions of the baselines and what hyperparameters were swept across. Code can be found in the Appendix. All proofs are in the Appendix.
Open Datasets Yes Datasets and architectures: The following are the datasets we use in our experiments, along with the associated model architectures: 12 Open ML datasets using fully-connected neural networks. 10 MNIST variants, SVHN, CIFAR10, 40 Celeb A tasks using convolutional networks. CIFAR10 and CIFAR100 with Res Net-50, Res Net-101, and Res Net-152. IMDB dataset using transformer network.
Dataset Splits Yes For each dataset, we use the standard train/test split if available, otherwise, we fix a random train/test split with ratio 2:1. we randomly select from the training set 1000 initial examples, 100 validation examples, and a batch of 1000 examples
Hardware Specification Yes For each run, we used a NVIDIA V100 GPU, which took up to several days to finish all 100 trials.
Software Dependencies No Code for the models in Keras can be found in the Appendix. and imports like tf.keras.Sequential are present, but specific version numbers for these software dependencies are not provided in the text.
Experiment Setup Yes train an initial model using Adam optimizer with default settings on the initial set and early stopping (i.e. stop when there s no improvement on the validation loss after 5 epochs) and default random initialization For distillation, we tune the trade-off parameter λ across t0.1, 0.2, ..., 0.9u.