reproducibilityindex.ai

Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference

Authors: Tao Lei, Junwen Bai, Siddhartha Brahma, Joshua Ainslie, Kenton Lee, Yanqi Zhou, Nan Du, Vincent Zhao, Yuexin Wu, Bo Li, Yu Zhang, Ming-Wei Chang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that the CODA approach provides an unexpectedly efﬁcient way to transfer knowledge. Across a variety of language, vision, and speech tasks, CODA achieves a 2x to 8x inference speed-up compared to the state-of-the-art Adapter approaches with moderate to no accuracy loss and the same parameter efﬁciency.
Researcher Affiliation	Industry	Correspondence: taoleics@gmail.com, junwen@google.com
Pseudocode	No	The paper describes iterative updates for the soft top-k function in Appendix C using mathematical equations and descriptions, but it does not present them in a formalized pseudocode or algorithm block.
Open Source Code	No	The paper references checkpoints from the T5 project ('2https://github.com/google-research/text-to-text-transfer-transformer/blob/main/ released_checkpoints.md#t511') but does not provide a direct link to the source code for the CODA methodology developed in this paper, nor an explicit statement about its release.
Open Datasets	Yes	We use the C4 corpus [Raffel et al., 2020] for pretraining text models. For speech models, we use the Libri Light corpus [Kahn et al., 2020] for pretraining. Our vision Transformer models use the same data and training procedure in Pix2Struct [Lee et al., 2022]. Our ﬁnetuning datasets for text models include the MNLI [Williams et al., 2018], RTE [Dagan et al., 2005, Haim et al., 2006, Giampiccolo et al., 2007, Bentivogli et al., 2009], Bool Q [Clark et al., 2019], SQu AD [Rajpurkar et al., 2016] and XSum [Narayan et al., 2018] datasets. The speech models are evaluated on the speech recognition task using the Libri Speech dataset [Panayotov et al., 2015]. Finally, we use the OCR-VQA [Mishra et al., 2019], Doc VQA [Mathew et al., 2021], and Screen2Words [Wang et al., 2021] datasets for vision models.
Dataset Splits	Yes	We report accuracy on the development set on 3 tasks 3 model sizes, and set the number of selected tokens k = n/r .
Hardware Specification	Yes	All models have been pre-trained using 128 or 256 TPUv3/TPUv4 chips. [...] We use a single TPUv4 chip and 128 sequences per batch.
Software Dependencies	Yes	For our text and vision experiments, we implement our models using JAX [Bradbury et al., 2018]. Speciﬁcally, our training and model modules are built on top of the T5X, Flax and Flaxformer framework [Roberts et al., 2022, Heek et al., 2020]. [...] For the speech experiments, we use Tensor Flow [Abadi et al., 2015] and the Lingvo framework [Shen et al., 2019].
Experiment Setup	Yes	Table 8 lists the hyper-parameters used for ﬁne-tuning, including the sequence length, learning rate, batch size and the number of ﬁne-tuning steps used. For NLP datasets, we set the maximum input length and decoding length to the 98th percentile of lengths in the training set. For vision datasets, we set the input length following the suggested values in Pix2struct. We also ﬁnd that annealing the number of routed tokens k can achieve better ﬁnetuning results. Speciﬁcally, we decrease k linearly from the sequence length n down to the target value n/r using the ﬁrst 10% to 20% of the ﬁnetuning steps.