Revisiting Few-sample BERT Fine-tuning

Authors: Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q Weinberger, Yoav Artzi

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically test the impact of these factors, and identify alternative practices that resolve the commonly observed instability of the process.We study these issues and their remedies through experiments on multiple common benchmarks, focusing on few-sample fine-tuning scenarios.
Researcher Affiliation Collaboration ASAPP Inc. Stanford University 3Penn State University Cornell University tz58@stanford.edu {fwu, kweinberger, yoav}@asapp.com arzoo@psu.edu
Pseudocode Yes Algorithm 1: the ADAM pseudocode adapted from Kingma & Ba (2014), and provided for reference.
Open Source Code No The paper mentions using and linking to third-party open-source libraries like HuggingFace's Transformers, PyTorch's ADAM implementation, and NVIDIA's Apex. However, it does not provide an explicit statement or link for the open-source code for the specific methodologies or experiments developed in this paper.
Open Datasets Yes We follow the data setup of previous studies (Lee et al., 2020; Phang et al., 2018; Dodge et al., 2020) to study few-sample fine-tuning using eight datasets from the GLUE benchmark (Wang et al., 2019b).RTE Recognizing Textual Entailment (Bentivogli et al., 2009) is a binary entailment classification task. We use the GLUE version.
Dataset Splits Yes For RTE, MRPC, STS-B, and Co LA, we divide the original validation set in half, using one half for validation and the other for test. For the other four larger datasets, we only study the downsampled versions, and split additional 1k samples from the training set as our validation data and test on the original validation set.
Hardware Specification No The paper mentions using 'mixed precision training using Apex' which implies GPU usage, but it does not specify any particular hardware components such as GPU models, CPU types, or memory specifications.
Software Dependencies Yes We use the Py Torch ADAM implementation https://pytorch.org/docs/1.4.0/_modules/torch/optim/adamw.html.
Experiment Setup Yes Unless noted otherwise, we follow the hyperparameter setup of Lee et al. (2020). We fine-tune the uncased, 24-layer BERTLarge model with batch size 32, dropout 0.1, and peak learning rate 2 10 5 for three epochs. We clip the gradients to have a maximum norm of 1. We apply linear learning rate warm-up during the first 10% of the updates followed by a linear decay. We evaluate ten times on the validation set during training and perform early stopping. We fine-tune with 20 random seeds to compare different settings.