BayesTune: Bayesian Sparse Deep Model Fine-tuning

Authors: Minyoung Kim, Timothy Hospedales

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Tested on popular NLP benchmarks as well as the VTAB vision tasks, our approach shows significant improvement over the state-of-the-arts (e.g., 1% point higher than the best SOTA when fine-tuning Ro BERTa for GLUE and Super GLUE benchmarks).
Researcher Affiliation Collaboration Minyoung Kim1 1Samsung AI Center Cambridge, UK mikim21@gmail.com Timothy Hospedales1,2 2University of Edinburgh, UK t.hospedales@ed.ac.uk
Pseudocode Yes Our overall algorithmm dubbed Bayes Tune, is summarized as pseudocodes in Alg. 1.
Open Source Code Yes The Python/Py Torch code to reproduce the results is available at https: // github. com/ Samsung Labs/ Bayes Tune 6. Alternatively, https://github.com/minyoungkim21/Bayes Tune
Open Datasets Yes We test our Bayes Tune on two popular benchmark datasets from NLP and vision for the downstream fine-tuning tasks: (language) fine-tuning the pre-trained Ro BERTa-base model [28] on the GLUE [37] and Super GLUE [38] tasks; (vision) fine-tuning the Image Net-22K [6] pre-trained Vi T-B/16 model [9] on VTAB-1K [44] image classification/prediction tasks.
Dataset Splits Yes We follow the experimental settings from [10], in which the original development sets serve as test sets, and the validation sets are formed by holding out random 10% of the training sets. and Each dataset in VTAB-1K consists of 1K training examples, and we use the splits officially provided (train 80% and validation 20%).
Hardware Specification Yes We run all models on a single NVIDIA V100 GPU with 32GB memory.
Software Dependencies No The paper mentions "Python/Py Torch code" and the use of the "jiant framework" and "Adam optimiser" but does not provide specific version numbers for these software dependencies, which are necessary for full reproducibility.
Experiment Setup Yes Throughout all experiments we use the Gamma prior parameters (α = 0.01, β = 100), the scale variables λi s are all initialized to 0.0001, and the learning rate for λ is 0.01 without scheduling. Other task-specific implementation details can be found in the subsequent sections. and In Baye Tune, we use 10K warm-up steps, 2K burn-in steps, and thinning at every 100 steps for all tasks. The batch size is 16, and the learning rate for the model parameters is 10 4 for Stage-1 and 10 3 for Stage-2. and The chosen hyperparameters are as follows ( ˆ N, γ): (NLP) cola = (11, 10 4), stsb = (12, 10 4), mrpc = (12, 100), rte = (8, 10 4), cb = (10, 10 4), copa = (8, 10 2), wsc = (10, 10 4); (VTAB) cifar100 = (7, 10 1), caltech101 = (9, 10 2), dtd = (12, 100), flower102 = (12, 10 2), pets = (12, 100), svhn = (10, 100), sun397 = (7, 10 1), camelyon = (6, 100), eurosat = (7, 10 1), resisc45 = (12, 10 2), retinopathy = (7, 10 2), clevr-count = (7, 10 3), clevr-dist = (7, 10 3), dmlab = (8, 100), kitti = (7, 100), dsprite-loc = (12, 10 4), dsprite-ori = (12, 10 3), snorbazim = (7, 10 2), snorb-ele = (6, 10 1).