On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets

Authors: Cheng-Han Chiang, Hung-yi Lee10518-10525

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By fine-tuning the pre-trained models on GLUE benchmark, we can learn how beneficial it is to transfer the knowledge from the model trained on the dataset possessing that specific trait. Our experiments show that the explicit dependencies in the sequences of the pre-training data are critical to the downstream performance. Our results also reveal that models achieve better downstream performance when pre-trained on a dataset with a longer range of implicit dependencies.
Researcher Affiliation Academia Cheng-Han Chiang, Hung-yi Lee National Taiwan University, Taiwan dcml0714@gmail.com, hungyilee@ntu.edu.tw
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement or link regarding the availability of its source code.
Open Datasets Yes We adopt the GLUE (Wang et al. 2019; Socher et al. 2013; Dolan and Brockett 2005; Cer et al. 2017; Williams, Nangia, and Bowman 2018; Rajpurkar et al. 2016) benchmarks to evaluate the models pre-trained on different L1s. We pre-train a Ro BERTa-medium using a subset of English Wikipedia. We pre-train a Ro BERTa-medium using Kannada from OSCAR dataset (Su arez, Romary, and Sagot 2020). The sentences used for computing the distribution of j here are from SQu AD (Rajpurkar et al. 2016). The Quora Question Pairs (QQP) (Iyer, Dandekar, and Csernai 2017).
Dataset Splits No The paper mentions using "the evaluation set" and "original GLUE training set" but does not specify explicit percentages or sample counts for training, validation, or test splits. While GLUE has standard splits, the paper does not explicitly state them for reproduction.
Hardware Specification Yes The whole process, from stage 1 to stage 3, takes three days on a single V100 GPU.
Software Dependencies No The paper mentions using RoBERTa (Liu et al. 2019) and Byte Pair Encoding (BPE) but does not specify version numbers for any software libraries, programming languages, or other dependencies.
Experiment Setup Yes We use a specific set of hyperparameters and three different random seeds to fine-tune the model for each task. We report the average and standard deviation over different seeds of the results on the evaluation set.