NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework
Authors: Xingcheng Yao, Yanan Zheng, Xiaocong Yang, Zhilin Yang
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On eight classification datasets in four domains, TLM achieves results better than or similar to pretrained language models (e.g., Ro BERTa-Large) while reducing the training FLOPs by two orders of magnitude. |
| Researcher Affiliation | Collaboration | 1Institute for Interdisciplinary Information Sciences, Tsinghua University 2Department of Computer Science and Technology, Tsinghua University 3School of Economics and Management, Tsinghua University 4Recurrent AI, Inc 5Shanghai Qi Zhi Institute. Correspondence to: Zhilin Yang <zhiliny@tsinghua.edu.cn>. |
| Pseudocode | No | TLM consists of two steps as shown in Figure 2. 1. Retrieve data from a general corpus using task data as queries. 2. Train a model from scratch by jointly optimizing the task objective and the language modeling objective on the retrieved data and task data. |
| Open Source Code | Yes | Our code, model checkpoints and datasets are publicly available at: https://github.com/yaoxingcheng/TLM |
| Open Datasets | Yes | High-resource tasks has more than 5K task data, including AGNews (Zhang et al., 2015), IMDB (Maas et al., 2011), RCT (Dernoncourt & Lee, 2017), and Helpfulness (Mc Auley et al., 2015), while low-resource tasks include Chem Prot (Kringelum et al., 2016), ACL-ARC (Jurgens et al., 2018), Sci ERC (Luan etm al., 2018), and Hyper Partisan (Kiesel et al., 2019). |
| Dataset Splits | Yes | We report the average performance across three random seeds, together with the standard deviation. We follow Beltagy et al. (2019) and Gururangan et al. (2020) to report the test micro-F1 for Chem Prot and RCT, and macro-F1 for the rest of the datasets. Results on the development set using different retrieval methods and different general corpora on each task. |
| Hardware Specification | No | This effectively reduces the cost from training on 1,000 GPUs for one day to training on 8 GPUs for 42 hours. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | Training Details For each experiment of TLM, while fixing the training scale hyper-parameters (i.e., training steps, batch size and sequence length), we perform a grid search over ρ1 and ρ2. We listed the hyper-parameters used in Table B.1 in Appendix. |