BayesTune: Bayesian Sparse Deep Model Fine-tuning
Authors: Minyoung Kim, Timothy Hospedales
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Tested on popular NLP benchmarks as well as the VTAB vision tasks, our approach shows significant improvement over the state-of-the-arts (e.g., 1% point higher than the best SOTA when fine-tuning Ro BERTa for GLUE and Super GLUE benchmarks). |
| Researcher Affiliation | Collaboration | Minyoung Kim1 1Samsung AI Center Cambridge, UK mikim21@gmail.com Timothy Hospedales1,2 2University of Edinburgh, UK t.hospedales@ed.ac.uk |
| Pseudocode | Yes | Our overall algorithmm dubbed Bayes Tune, is summarized as pseudocodes in Alg. 1. |
| Open Source Code | Yes | The Python/Py Torch code to reproduce the results is available at https: // github. com/ Samsung Labs/ Bayes Tune 6. Alternatively, https://github.com/minyoungkim21/Bayes Tune |
| Open Datasets | Yes | We test our Bayes Tune on two popular benchmark datasets from NLP and vision for the downstream fine-tuning tasks: (language) fine-tuning the pre-trained Ro BERTa-base model [28] on the GLUE [37] and Super GLUE [38] tasks; (vision) fine-tuning the Image Net-22K [6] pre-trained Vi T-B/16 model [9] on VTAB-1K [44] image classification/prediction tasks. |
| Dataset Splits | Yes | We follow the experimental settings from [10], in which the original development sets serve as test sets, and the validation sets are formed by holding out random 10% of the training sets. and Each dataset in VTAB-1K consists of 1K training examples, and we use the splits officially provided (train 80% and validation 20%). |
| Hardware Specification | Yes | We run all models on a single NVIDIA V100 GPU with 32GB memory. |
| Software Dependencies | No | The paper mentions "Python/Py Torch code" and the use of the "jiant framework" and "Adam optimiser" but does not provide specific version numbers for these software dependencies, which are necessary for full reproducibility. |
| Experiment Setup | Yes | Throughout all experiments we use the Gamma prior parameters (α = 0.01, β = 100), the scale variables λi s are all initialized to 0.0001, and the learning rate for λ is 0.01 without scheduling. Other task-specific implementation details can be found in the subsequent sections. and In Baye Tune, we use 10K warm-up steps, 2K burn-in steps, and thinning at every 100 steps for all tasks. The batch size is 16, and the learning rate for the model parameters is 10 4 for Stage-1 and 10 3 for Stage-2. and The chosen hyperparameters are as follows ( ˆ N, γ): (NLP) cola = (11, 10 4), stsb = (12, 10 4), mrpc = (12, 100), rte = (8, 10 4), cb = (10, 10 4), copa = (8, 10 2), wsc = (10, 10 4); (VTAB) cifar100 = (7, 10 1), caltech101 = (9, 10 2), dtd = (12, 100), flower102 = (12, 10 2), pets = (12, 100), svhn = (10, 100), sun397 = (7, 10 1), camelyon = (6, 100), eurosat = (7, 10 1), resisc45 = (12, 10 2), retinopathy = (7, 10 2), clevr-count = (7, 10 3), clevr-dist = (7, 10 3), dmlab = (8, 100), kitti = (7, 100), dsprite-loc = (12, 10 4), dsprite-ori = (12, 10 3), snorbazim = (7, 10 2), snorb-ele = (6, 10 1). |