Exploring the Benefits of Training Expert Language Models over Instruction Tuning

Authors: Joel Jang, Seungone Kim, Seonghyeon Ye, Doyoung Kim, Lajanugen Logeswaran, Moontae Lee, Kyungjae Lee, Minjoon Seo

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we report an unexpected finding that an expert LM fine-tuned on just a single task can outperform an MT LM trained with 300+ different tasks on 11 different unseen datasets and on 13 datasets of the BIG-bench benchmark by a mean accuracy of 3.20% and 1.29%, respectively. Main Results Table 1 shows the evaluation results on the 11 unseen datasets, Table 2 shows the results on the 13 unseen BIG-Bench tasks, and Table 3 shows the results on the 8 unseen generative tasks.
Researcher Affiliation Collaboration 1KAIST 2LG AI Research 3University of Illinois Chicago. Correspondence to: Joel Jang <joeljang@kaist.ac.kr>.
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes The code is available at https://github.com/joeljang/ELM.
Open Datasets Yes Following the setting of Sanh et al. (2021), we use a total of 36 training datasets of T0 for training our experts.
Dataset Splits Yes We evaluate the baseline MT LMs (T03B, T0-11B) and our proposed method (T5-3B + DE/PE) on the same evaluation setting as the original T0 paper (Sanh et al., 2021): 11 unseen datasets that can be categorized into 4 task categories and on 13 datasets from BIG-Bench benchmark (Srivastava et al., 2022), which are diverse and challenging tasks that are not encountered during training. Table 5. Evaluation performance on 300 sample instances from each validation dataset of the 36 training tasks categorized into 8 task categories.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions specific models and libraries used (e.g., T5 model, Promptsource Library, Sentence Transformer) along with their citations, but does not provide specific version numbers for general software dependencies like Python, PyTorch, or TensorFlow, which are necessary for full reproducibility.
Experiment Setup Yes For each individual fine-tuning, we randomly sample K = 50, 000 training instances for each classification task and K = 10, 000 for each generative task. We use the LM-adapted T5 model (Lester et al., 2021) checkpoint as our base model, and train for 5 epochs with a constant learning rate of 1e-4 for both adapter fine-tuning and full LM fine-tuning. For the construction of the Expert Library, much smaller S = 100 training instances are randomly sampled for each expert following Ye et al. (2022a). During inference, we set Q = 32 for applying our Retrieval-of-Expert (Ro E) mechanism.