On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets
Authors: Cheng-Han Chiang, Hung-yi Lee10518-10525
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By fine-tuning the pre-trained models on GLUE benchmark, we can learn how beneficial it is to transfer the knowledge from the model trained on the dataset possessing that specific trait. Our experiments show that the explicit dependencies in the sequences of the pre-training data are critical to the downstream performance. Our results also reveal that models achieve better downstream performance when pre-trained on a dataset with a longer range of implicit dependencies. |
| Researcher Affiliation | Academia | Cheng-Han Chiang, Hung-yi Lee National Taiwan University, Taiwan dcml0714@gmail.com, hungyilee@ntu.edu.tw |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement or link regarding the availability of its source code. |
| Open Datasets | Yes | We adopt the GLUE (Wang et al. 2019; Socher et al. 2013; Dolan and Brockett 2005; Cer et al. 2017; Williams, Nangia, and Bowman 2018; Rajpurkar et al. 2016) benchmarks to evaluate the models pre-trained on different L1s. We pre-train a Ro BERTa-medium using a subset of English Wikipedia. We pre-train a Ro BERTa-medium using Kannada from OSCAR dataset (Su arez, Romary, and Sagot 2020). The sentences used for computing the distribution of j here are from SQu AD (Rajpurkar et al. 2016). The Quora Question Pairs (QQP) (Iyer, Dandekar, and Csernai 2017). |
| Dataset Splits | No | The paper mentions using "the evaluation set" and "original GLUE training set" but does not specify explicit percentages or sample counts for training, validation, or test splits. While GLUE has standard splits, the paper does not explicitly state them for reproduction. |
| Hardware Specification | Yes | The whole process, from stage 1 to stage 3, takes three days on a single V100 GPU. |
| Software Dependencies | No | The paper mentions using RoBERTa (Liu et al. 2019) and Byte Pair Encoding (BPE) but does not specify version numbers for any software libraries, programming languages, or other dependencies. |
| Experiment Setup | Yes | We use a specific set of hyperparameters and three different random seeds to fine-tune the model for each task. We report the average and standard deviation over different seeds of the results on the evaluation set. |