NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework

Authors: Xingcheng Yao, Yanan Zheng, Xiaocong Yang, Zhilin Yang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On eight classification datasets in four domains, TLM achieves results better than or similar to pretrained language models (e.g., Ro BERTa-Large) while reducing the training FLOPs by two orders of magnitude.
Researcher Affiliation Collaboration 1Institute for Interdisciplinary Information Sciences, Tsinghua University 2Department of Computer Science and Technology, Tsinghua University 3School of Economics and Management, Tsinghua University 4Recurrent AI, Inc 5Shanghai Qi Zhi Institute. Correspondence to: Zhilin Yang <zhiliny@tsinghua.edu.cn>.
Pseudocode No TLM consists of two steps as shown in Figure 2. 1. Retrieve data from a general corpus using task data as queries. 2. Train a model from scratch by jointly optimizing the task objective and the language modeling objective on the retrieved data and task data.
Open Source Code Yes Our code, model checkpoints and datasets are publicly available at: https://github.com/yaoxingcheng/TLM
Open Datasets Yes High-resource tasks has more than 5K task data, including AGNews (Zhang et al., 2015), IMDB (Maas et al., 2011), RCT (Dernoncourt & Lee, 2017), and Helpfulness (Mc Auley et al., 2015), while low-resource tasks include Chem Prot (Kringelum et al., 2016), ACL-ARC (Jurgens et al., 2018), Sci ERC (Luan etm al., 2018), and Hyper Partisan (Kiesel et al., 2019).
Dataset Splits Yes We report the average performance across three random seeds, together with the standard deviation. We follow Beltagy et al. (2019) and Gururangan et al. (2020) to report the test micro-F1 for Chem Prot and RCT, and macro-F1 for the rest of the datasets. Results on the development set using different retrieval methods and different general corpora on each task.
Hardware Specification No This effectively reduces the cost from training on 1,000 GPUs for one day to training on 8 GPUs for 42 hours.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes Training Details For each experiment of TLM, while fixing the training scale hyper-parameters (i.e., training steps, batch size and sequence length), we perform a grid search over ρ1 and ρ2. We listed the hyper-parameters used in Table B.1 in Appendix.