reproducibilityindex.ai

Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners

Authors: Seonghyeon Ye, Doyoung Kim, Joel Jang, Joongbo Shin, Minjoon Seo

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On 14 tasks of the BIG-bench benchmark, the 11B-sized FLIPPED outperforms zero-shot T011B (Sanh et al., 2021) and even a 16 times larger 3-shot GPT-3 (175B) (Brown et al., 2020) on average by 8.4% and 9.7% points, respectively. FLIPPED gives particularly large improvements on tasks with unseen labels, outperforming T0-11B by up to +20% average F1 score. This indicates that the strong task generalization of FLIPPED comes from improved generalization to novel labels.
Researcher Affiliation	Collaboration	Seonghyeon Ye1 Doyoung Kim1 Joel Jang1 Joongbo Shin2 Minjoon Seo1 1KAIST 2LG AI Research {seonghyeon.ye,ikevin98,joeljang,minjoon}@kaist.ac.kr jb.shin@lgresearch.ai
Pseudocode	No	No pseudocode or clearly labeled algorithm block was found.
Open Source Code	Yes	We release our code at github.com/seonghyeonye/Flipped-Learning.
Open Datasets	Yes	For meta-training, we utilize the subset of T0 (Sanh et al., 2021) meta-training datasets: 4 task clusters (sentiment classification, paraphrase detection, topic classification, multi-choice QA), which are 20 datasets in total. We use imdb (Maas et al., 2011), amazon polarity (Mc Auley & Leskovec, 2013), rotten tomatoes (Pang & Lee, 2005), yelp review full (Zhang et al., 2015b), app reviews for sentiment, glue/qqp (Wang et al., 2018), paws/labeled final (Zhang et al., 2019), glue/mrpc (Dolan & Brockett, 2005) for paraphrase, ag news (Zhang et al., 2015a), dbpedia 14 (Lehmann et al., 2015) for topic classification, cos e/v1.11 (Rajani et al., 2019), dream (Sun et al., 2019), quail (Rogers et al., 2020), quartz (Tafjord et al., 2019b), social i qa (Sap et al., 2019), wiqa (Tandon et al., 2019), cosmos qa (Huang et al., 2019), qasc (Khot et al., 2020), quarel (Tafjord et al., 2019a), sciq (Welbl et al., 2017) for multichoice QA.
Dataset Splits	No	The paper describes meta-training datasets and evaluation (test) datasets but does not explicitly specify a separate 'validation' dataset split with specific percentages or counts for hyperparameter tuning or early stopping during the training process.
Hardware Specification	No	No specific hardware specifications (e.g., GPU/CPU models, memory details) used for running the experiments are provided in the paper.
Software Dependencies	No	The paper mentions using T5.1.1 (Raffel et al., 2019) and T5-LM adapted model (Lester et al., 2021) as backbone LMs, but does not provide specific version numbers for software dependencies like programming languages, frameworks, or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We train each model for 5K steps, with a batch size of 240. We set input and output sequence lengths as 512 and 128 respectively for FLIPPED-3B. For FLIPPED-11B, we set input and output sequence lengths as 384 and 64 respectively for computational efficiency. For DIRECT and CHANNEL, we set the learning rate as 1e-4 and for FLIPPED, we set the learning rate as 5e-5 because the training objective is different (generation vs denoising). We set the weight hyperparameter of likelihood and unlikelihood loss as λ = 3.