Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference

Authors: Tao Lei, Junwen Bai, Siddhartha Brahma, Joshua Ainslie, Kenton Lee, Yanqi Zhou, Nan Du, Vincent Zhao, Yuexin Wu, Bo Li, Yu Zhang, Ming-Wei Chang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that the CODA approach provides an unexpectedly efficient way to transfer knowledge. Across a variety of language, vision, and speech tasks, CODA achieves a 2x to 8x inference speed-up compared to the state-of-the-art Adapter approaches with moderate to no accuracy loss and the same parameter efficiency.
Researcher Affiliation Industry Correspondence: taoleics@gmail.com, junwen@google.com
Pseudocode No The paper describes iterative updates for the soft top-k function in Appendix C using mathematical equations and descriptions, but it does not present them in a formalized pseudocode or algorithm block.
Open Source Code No The paper references checkpoints from the T5 project ('2https://github.com/google-research/text-to-text-transfer-transformer/blob/main/ released_checkpoints.md#t511') but does not provide a direct link to the source code for the CODA methodology developed in this paper, nor an explicit statement about its release.
Open Datasets Yes We use the C4 corpus [Raffel et al., 2020] for pretraining text models. For speech models, we use the Libri Light corpus [Kahn et al., 2020] for pretraining. Our vision Transformer models use the same data and training procedure in Pix2Struct [Lee et al., 2022]. Our finetuning datasets for text models include the MNLI [Williams et al., 2018], RTE [Dagan et al., 2005, Haim et al., 2006, Giampiccolo et al., 2007, Bentivogli et al., 2009], Bool Q [Clark et al., 2019], SQu AD [Rajpurkar et al., 2016] and XSum [Narayan et al., 2018] datasets. The speech models are evaluated on the speech recognition task using the Libri Speech dataset [Panayotov et al., 2015]. Finally, we use the OCR-VQA [Mishra et al., 2019], Doc VQA [Mathew et al., 2021], and Screen2Words [Wang et al., 2021] datasets for vision models.
Dataset Splits Yes We report accuracy on the development set on 3 tasks 3 model sizes, and set the number of selected tokens k = n/r .
Hardware Specification Yes All models have been pre-trained using 128 or 256 TPUv3/TPUv4 chips. [...] We use a single TPUv4 chip and 128 sequences per batch.
Software Dependencies Yes For our text and vision experiments, we implement our models using JAX [Bradbury et al., 2018]. Specifically, our training and model modules are built on top of the T5X, Flax and Flaxformer framework [Roberts et al., 2022, Heek et al., 2020]. [...] For the speech experiments, we use Tensor Flow [Abadi et al., 2015] and the Lingvo framework [Shen et al., 2019].
Experiment Setup Yes Table 8 lists the hyper-parameters used for fine-tuning, including the sequence length, learning rate, batch size and the number of fine-tuning steps used. For NLP datasets, we set the maximum input length and decoding length to the 98th percentile of lengths in the training set. For vision datasets, we set the input length following the suggested values in Pix2struct. We also find that annealing the number of routed tokens k can achieve better finetuning results. Specifically, we decrease k linearly from the sequence length n down to the target value n/r using the first 10% to 20% of the finetuning steps.