reproducibilityindex.ai

Variational Information Bottleneck for Effective Low-Resource Fine-Tuning

Authors: Rabeeh Karimi mahabadi, Yonatan Belinkov, James Henderson

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluation on seven low-resource datasets in different tasks shows that our method significantly improves transfer learning in low-resource scenarios, surpassing prior work.
Researcher Affiliation	Academia	EPFL, Switzerland Idiap Research Institute, Switzerland Technion Israel Institute of Technology {rabeeh.karimi,james.henderson}@idiap.ch belinkov@technion.ac.il
Pseudocode	No	The paper describes the approach textually and with a diagram (Figure 1), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	Our code is publicly available in https://github.com/rabeehk/vibert.
Open Datasets	Yes	For NLI, we experiment with two well-known NLI benchmarks, namely SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018). For text classification, we evaluate on two sentiment analysis datasets, namely IMDB (Maas et al., 2011) and Yelp2013 (YELP) (Zhang et al., 2015). We additionally evaluate on three low-resource datasets in the GLUE benchmark (Wang et al., 2019): paraphrase detection using MRPC (Dolan & Brockett, 2005), semantic textual similarity using STS-B (Cer et al., 2017), and textual entailment using RTE (Dagan et al., 2006).
Dataset Splits	Yes	For the GLUE benchmark, SNLI, and Yelp, we evaluate on the standard validation and test splits. (Table 7 also explicitly lists 'Val.' counts for each dataset).
Hardware Specification	Yes	We run all experiments on one GTX1080Ti GPU with 11 GB of RAM.
Software Dependencies	No	The paper mentions using a 'Py Torch implementation' of BERT models by Wolf et al. (2019), but it does not specify exact version numbers for PyTorch or other libraries like 'transformers', 'numpy', etc., which are necessary for reproducible software dependencies.
Experiment Setup	Yes	We use the default hyper-parameters of BERT, i.e., we use a sequence length of 128, with batch size 32. We use the stable variant of the Adam optimizer (Zhang et al., 2021; Mosbach et al., 2021) with the default learning rate of 2e 5 through all experiments. We do not use warm-up or weight decay. For VIBERT, we sweep β over {10 4, 10 5, 10 6} and K over {144, 192, 288, 384}.