LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

Authors: Yi-Lin Sung, Jaemin Cho, Mohit Bansal

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method with various models (T5 and CLIP-T5) on both natural language processing (GLUE) and vision-and-language (VQA, GQA, NLVR2, MSCOCO) tasks. LST saves 69% of the memory costs to fine-tune the whole network, while other methods only save 26% of that in similar parameter usages (hence, 2.7x more memory savings). Moreover, LST achieves higher accuracy than Adapter and Lo RA in a low-memory regime.
Researcher Affiliation Academia Yi-Lin Sung Jaemin Cho Mohit Bansal UNC Chapel Hill {ylsung, jmincho, mbansal}@cs.unc.edu
Pseudocode No The paper does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Our code is available at: https://github.com/ylsung/Ladder-Side-Tuning.
Open Datasets Yes For NLP tasks, we use the GLUE [55] benchmark, which consists of seven classification and one regression task. For VL tasks, we experiment with visual question answering (VQA [16], GQA [25]), visual reasoning (NLVR2 [50]) and image captioning (MSCOCO [6]) tasks.
Dataset Splits Yes Since there is no local test set, we split 1k samples from the training set as the new validation set and use the original validation set as the test set. For datasets whose samples are less than 10k (RTE, MRPC, STS-B, Co LA), we split the validation set into two equal-sized subsets and treat them as a new validation and test set. For MNLI, we use the mismatched set as the validation set and matched set as the test set.
Hardware Specification Yes The experiments on T5 take around 12 hours to train with one A6000 GPU (48GB).
Software Dependencies No The paper mentions software like T5 and CLIP-T5 but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We search for learning rates over {3 10 4, 1 10 3, 3 10 3} for LST and Lo RA[24]... The reduction factor used in LST is set to 8 if not additionally specified. We train every approach with 10 epochs on large datasets and 20 epochs on small ones...