Insights into Pre-training via Simpler Synthetic Tasks

Authors: Yuhuai Wu, Felix Li, Percy S. Liang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we perform three experiments that iteratively simplify pre-training and show that the simplifications still retain much of its gains. First, building on prior work, we perform a systematic evaluation of three existing synthetic pre-training methods on six downstream tasks.
Researcher Affiliation Collaboration Yuhuai Wu12 yuhuai@cs.stanford.edu Felix Li3 fzli@berkeley.edu Percy Liang1 pliang@cs.stanford.edu 1Stanford University 2Google Research 3UC Berkeley
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We release the source code at https://github.com/felixzli/synthetic_pretraining.
Open Datasets Yes We fine-tuned synthetically pre-trained models on a diverse suite of downstream tasks: 1) Java to C# code translation (10K training examples) (Lu et al., 2021); 2) two semantic parsing benchmarks, MTOP (17K training examples) (Li et al., 2021) and Web QSP (2.7K training examples) (Yih et al., 2016)... 3) USPTO-50K retrosynthesis (40K training examples) (Liu et al., 2017)... 4) the reading comprehension benchmark SQuAD 1.1 (87K training examples) (Rajpurkar et al., 2016); and 5) the summarization benchmark CNNDM-10K3 which is 10K training examples from the CNNDM (Krishna et al., 2021).
Dataset Splits Yes For synthetic pre-training, we use the same hyperparameters that the off-the-shelf language pre-trained T5-small was trained with: Ada Factor optimizer, batch size 128, sequence length 512, and inverse square root learning rate 1/ p max(n, 10000) where n is the current training step. We evaluate token validation accuracy every 5000 training steps.
Hardware Specification No The paper mentions "Google TPU Research Cloud" for experimental support, but does not specify particular TPU versions (e.g., v2, v3, v4) or other specific hardware models (GPU/CPU).
Software Dependencies No The paper does not provide specific software dependency names with version numbers in the main text.
Experiment Setup Yes Training Details For synthetic pre-training, we use the same hyperparameters that the off-the-shelf language pre-trained T5-small was trained with: Ada Factor optimizer, batch size 128, sequence length 512, and inverse square root learning rate 1/ p max(n, 10000) where n is the current training step.