Variational Information Bottleneck for Effective Low-Resource Fine-Tuning

Authors: Rabeeh Karimi mahabadi, Yonatan Belinkov, James Henderson

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluation on seven low-resource datasets in different tasks shows that our method significantly improves transfer learning in low-resource scenarios, surpassing prior work.
Researcher Affiliation Academia EPFL, Switzerland Idiap Research Institute, Switzerland Technion Israel Institute of Technology {rabeeh.karimi,james.henderson}@idiap.ch belinkov@technion.ac.il
Pseudocode No The paper describes the approach textually and with a diagram (Figure 1), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Our code is publicly available in https://github.com/rabeehk/vibert.
Open Datasets Yes For NLI, we experiment with two well-known NLI benchmarks, namely SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018). For text classification, we evaluate on two sentiment analysis datasets, namely IMDB (Maas et al., 2011) and Yelp2013 (YELP) (Zhang et al., 2015). We additionally evaluate on three low-resource datasets in the GLUE benchmark (Wang et al., 2019): paraphrase detection using MRPC (Dolan & Brockett, 2005), semantic textual similarity using STS-B (Cer et al., 2017), and textual entailment using RTE (Dagan et al., 2006).
Dataset Splits Yes For the GLUE benchmark, SNLI, and Yelp, we evaluate on the standard validation and test splits. (Table 7 also explicitly lists 'Val.' counts for each dataset).
Hardware Specification Yes We run all experiments on one GTX1080Ti GPU with 11 GB of RAM.
Software Dependencies No The paper mentions using a 'Py Torch implementation' of BERT models by Wolf et al. (2019), but it does not specify exact version numbers for PyTorch or other libraries like 'transformers', 'numpy', etc., which are necessary for reproducible software dependencies.
Experiment Setup Yes We use the default hyper-parameters of BERT, i.e., we use a sequence length of 128, with batch size 32. We use the stable variant of the Adam optimizer (Zhang et al., 2021; Mosbach et al., 2021) with the default learning rate of 2e 5 through all experiments. We do not use warm-up or weight decay. For VIBERT, we sweep β over {10 4, 10 5, 10 6} and K over {144, 192, 288, 384}.