reproducibilityindex.ai

Not All Tasks Are Born Equal: Understanding Zero-Shot Generalization

Authors: Jing Zhou, Zongyu Lin, Yanan Zheng, Jian Li, Zhilin Yang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	For the first time, we show that training on a small number of key tasks beats using all the training tasks, while removing these key tasks substantially hurts performance. We also find that these key tasks are mostly question answering (QA) tasks. These novel findings combined deepen our understanding about zero-shot generalization training on certain tasks such as QA encodes general knowledge transferable to a wide range of tasks. In addition, to automate this procedure, we devise a method that (1) identifies key training tasks without observing the test tasks by examining the pairwise generalization results and (2) resamples training tasks for better data distribution. Empirically, our approach achieves improved results across various model scales and tasks.
Researcher Affiliation	Academia	1Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University 2Department of Computer Science and Technology, Tsinghua University 3Shanghai Artificial Intelligence Laboratory, 4Shanghai Qi Zhi Institute zhouj18@mails.tsinghua.edu.cn {lijian83,zhiliny}@mail.tsinghua.edu.cn
Pseudocode	No	The paper describes the steps of the proposed method in paragraph form (e.g., 'Formally, we are given a set of training tasks...', 'Generally, our method consists of three major steps.'), but does not present them as structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is released at https://github.com/zhouj8553/Improving-T0.
Open Datasets	Yes	We followed the setting of T0 (Sanh et al., 2022) and adopted the tasks therein. There are 38 training tasks across 8 task types, and 11 test tasks ranging from natural language inference (RTE (Candela et al., 2006), CB (De Marneffe et al., 2019), ANLI/R1-R3 (Nie et al., 2020)), coreference resolution (WSC (Levesque et al., 2012), Winogrande (Sakaguchi et al., 2020)), sentence completion (COPA (Roemmele et al., 2011), Story Cloze (Mostafazadeh et al., 2017), Hellaswag (Zellers et al., 2019)), to word disambiguation (Wi C (Pilehvar & Camacho-Collados, 2019)).
Dataset Splits	No	The paper mentions 'training tasks' and 'test tasks' but does not specify details about a validation set or its split from the data.
Hardware Specification	No	The paper mentions training on different model scales (T5-Large, T5-XL, T5-XXL, GPT-Neo1.3B) and refers to 'limitation of computing resources' but does not provide specific hardware details such as GPU models, CPU types, or memory.
Software Dependencies	No	The paper mentions using models like T5 and GPT-Neo, and specifies an optimizer (ADAM), but does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow).
Experiment Setup	Yes	For all the experiments, we adopt the ADAM optimizer and use a learning rate of 1e-4. and Training hyper-parameters for our experiments. Experiment Batch Size Steps Single Task Transfer 512 1000 Top-3 Task Transfer 1024 2000 Top-8 Task Transfer 1024 10000 Full Dataset 1024 20000 and Based on our resampling strategy, we set the pre-detection parameters as TH1 = 5, TH2 = 10, and then choose the datasets which are counted as the key tasks at least twice (i.e., all tasks A with g(A) 2). Given each key task D with data size \|D\|, we duplicate D by 5 times (Nu = 5) for the upsampling strategy and empirically start from 50,000 samples for each dataset. For the downsampling strategy, we downsample each non-key task to Nd = min(50, 000, \|D\|) samples.