Insights into Pre-training via Simpler Synthetic Tasks
Authors: Yuhuai Wu, Felix Li, Percy S. Liang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we perform three experiments that iteratively simplify pre-training and show that the simplifications still retain much of its gains. First, building on prior work, we perform a systematic evaluation of three existing synthetic pre-training methods on six downstream tasks. |
| Researcher Affiliation | Collaboration | Yuhuai Wu12 yuhuai@cs.stanford.edu Felix Li3 fzli@berkeley.edu Percy Liang1 pliang@cs.stanford.edu 1Stanford University 2Google Research 3UC Berkeley |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release the source code at https://github.com/felixzli/synthetic_pretraining. |
| Open Datasets | Yes | We fine-tuned synthetically pre-trained models on a diverse suite of downstream tasks: 1) Java to C# code translation (10K training examples) (Lu et al., 2021); 2) two semantic parsing benchmarks, MTOP (17K training examples) (Li et al., 2021) and Web QSP (2.7K training examples) (Yih et al., 2016)... 3) USPTO-50K retrosynthesis (40K training examples) (Liu et al., 2017)... 4) the reading comprehension benchmark SQuAD 1.1 (87K training examples) (Rajpurkar et al., 2016); and 5) the summarization benchmark CNNDM-10K3 which is 10K training examples from the CNNDM (Krishna et al., 2021). |
| Dataset Splits | Yes | For synthetic pre-training, we use the same hyperparameters that the off-the-shelf language pre-trained T5-small was trained with: Ada Factor optimizer, batch size 128, sequence length 512, and inverse square root learning rate 1/ p max(n, 10000) where n is the current training step. We evaluate token validation accuracy every 5000 training steps. |
| Hardware Specification | No | The paper mentions "Google TPU Research Cloud" for experimental support, but does not specify particular TPU versions (e.g., v2, v3, v4) or other specific hardware models (GPU/CPU). |
| Software Dependencies | No | The paper does not provide specific software dependency names with version numbers in the main text. |
| Experiment Setup | Yes | Training Details For synthetic pre-training, we use the same hyperparameters that the off-the-shelf language pre-trained T5-small was trained with: Ada Factor optimizer, batch size 128, sequence length 512, and inverse square root learning rate 1/ p max(n, 10000) where n is the current training step. |