reproducibilityindex.ai

Revisiting Few-sample BERT Fine-tuning

Authors: Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q Weinberger, Yoav Artzi

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically test the impact of these factors, and identify alternative practices that resolve the commonly observed instability of the process.We study these issues and their remedies through experiments on multiple common benchmarks, focusing on few-sample ﬁne-tuning scenarios.
Researcher Affiliation	Collaboration	ASAPP Inc. Stanford University 3Penn State University Cornell University tz58@stanford.edu {fwu, kweinberger, yoav}@asapp.com arzoo@psu.edu
Pseudocode	Yes	Algorithm 1: the ADAM pseudocode adapted from Kingma & Ba (2014), and provided for reference.
Open Source Code	No	The paper mentions using and linking to third-party open-source libraries like HuggingFace's Transformers, PyTorch's ADAM implementation, and NVIDIA's Apex. However, it does not provide an explicit statement or link for the open-source code for the specific methodologies or experiments developed in this paper.
Open Datasets	Yes	We follow the data setup of previous studies (Lee et al., 2020; Phang et al., 2018; Dodge et al., 2020) to study few-sample ﬁne-tuning using eight datasets from the GLUE benchmark (Wang et al., 2019b).RTE Recognizing Textual Entailment (Bentivogli et al., 2009) is a binary entailment classiﬁcation task. We use the GLUE version.
Dataset Splits	Yes	For RTE, MRPC, STS-B, and Co LA, we divide the original validation set in half, using one half for validation and the other for test. For the other four larger datasets, we only study the downsampled versions, and split additional 1k samples from the training set as our validation data and test on the original validation set.
Hardware Specification	No	The paper mentions using 'mixed precision training using Apex' which implies GPU usage, but it does not specify any particular hardware components such as GPU models, CPU types, or memory specifications.
Software Dependencies	Yes	We use the Py Torch ADAM implementation https://pytorch.org/docs/1.4.0/_modules/torch/optim/adamw.html.
Experiment Setup	Yes	Unless noted otherwise, we follow the hyperparameter setup of Lee et al. (2020). We ﬁne-tune the uncased, 24-layer BERTLarge model with batch size 32, dropout 0.1, and peak learning rate 2 10 5 for three epochs. We clip the gradients to have a maximum norm of 1. We apply linear learning rate warm-up during the ﬁrst 10% of the updates followed by a linear decay. We evaluate ten times on the validation set during training and perform early stopping. We ﬁne-tune with 20 random seeds to compare different settings.