CIFD: Controlled Information Flow to Enhance Knowledge Distillation

Authors: Yashas Malur Saidutta, Rakshith Sharma Srinivasa, Jaejin Cho, Ching-Hua Lee, Chouchang Yang, Yilin Shen, Hongxia Jin

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we propose a novel framework called Controlled Information Flow for Knowledge Distillation (CIFD) consisting of two components. First, we propose a significantly smaller alternatives to TAs, the Rate-Distortion Module (RDM) which uses the teacher s penultimate layer embedding and a information rate-constrained bottleneck layer to replace the Teacher Assistant model. RDMs are smaller and easier to train than TAs, especially in large data regimes, since they operate on the teacher embeddings and do not need to relearn low level input feature extractors. Also, by varying the information rate across the bottleneck, RDMs can replace TAs of different sizes. Secondly, we propose the use of Information Bottleneck Module in the student model, which is crucial for regularization in the presence of a large number of RDMs. We show comprehensive state-of-the-art results of the proposed method over large datasets like Imagenet. Further, we show the significant improvement in distilling CLIP like models over a huge 12M image-text dataset. It outperforms CLIP specialized distillation methods across five zero-shot classification datasets and two zero-shot image-text retrieval datasets.
Researcher Affiliation Industry Yashas Malur Saidutta Rakshith S Srinivasa Jaejin Cho Ching-Hua Lee Chouchang Yang Yilin Shen Hongxia Jin Samsung Research America, Mountain View, CA {ym.saidutta, r.srinivasa, jaejin.cho, chinghua.l}@samsung.com {c.yang1, yilin.shen, hongxia.jin}@samsung.com
Pseudocode No The paper includes diagrams illustrating its framework and training schemes (e.g., Figure 2, Figure 6), but it does not contain formal pseudocode or algorithm blocks.
Open Source Code No As our experiments are implemented based on open-source code and publicly available datasets, we have provided the necessary details in our paper for reproducing the results on top of the public code base and database, with the associated URLs provided. (However, the NeurIPS checklist explicitly states 'No' to providing their own open access code.)
Open Datasets Yes Our experimental results are split into two sections, one dealing with supervised classification on the CIFAR-100 [45] and Imagenet (IN) [46] datasets, and another with CLIP like models trained on Conceptual Captions 12M dataset [47].
Dataset Splits Yes We split the training data in imagenet into a training and validation set with ratio 0.95 : 0.05.
Hardware Specification Yes CLIP like models take 6 A100 GPU days for training.
Software Dependencies No The paper mentions several software packages like Optuna, Ray Tune, ffcv, and Openclip, but it does not provide specific version numbers for these or other key software components like Python or PyTorch.
Experiment Setup Yes We used the Optuna algorithm [59] along with the Asynchronous Hyperband Scheduler [60] in the Ray Tune package [61] package for hyperparameter optimization. Using this package we optimized, distillation temperature (τ), learning rate of the optimizer (SGD), learning rate decay, momentum of the optimizer, dropout (this is the KD dropout proposed in [46]), and all the λs (λKL, λ1, . . . , λ5, λIBM) involved in (9). The weight decay was fixed to 10 4.