MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction Following

Authors: Renze Lou, Kai Zhang, Jian Xie, Yuxuan Sun, Janice Ahn, Hanzi Xu, Yu Su, Wenpeng Yin

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results across four zero-shot benchmarks, spanning both Scaling-Inputs and Scaling Input-Free Tasks schemes, reveal that LLMs, at various scales, trained on MUFFIN generally demonstrate superior instruction-following capabilities compared to those trained on the two aforementioned schemes.
Researcher Affiliation Academia The Pennsylvania State University; The Ohio State University; Fudan University; Westlake University; Temple University
Pseudocode No The paper describes its data construction pipeline in Figure 2 and details its strategies in text, but it does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes All the code and data are available at our project page: https://renzelou.github.io/Muffin/
Open Datasets Yes SUPERNI (Wang et al., 2022) is a human-annotated dataset encompassing 1,600+ NLP tasks across diverse categories, sourced from existing benchmarks or created by human experts, implying a remarkable input diversity of SUPERNI. Therefore, we randomly select inputs from the training tasks of SUPERNI as our input text source (only inputs, no outputs are sampled at this stage).
Dataset Splits Yes All the above hyper-parameters are tuned on the validation set of SUPERNI.
Hardware Specification Yes All the experiments are done on NVIDIA A100 with 80 GPU memories.
Software Dependencies Yes All of our implementations are based on Hugging Face transformers (Wolf et al., 2019).
Experiment Setup Yes We fine-tune T5 on MUFFIN with 2 epochs. When fine-tuning T5-3B, we set the learning rate as 5e 5 with batch size 6. As for T5-11B, we set the learning rate as 1e 5 with batch size 1 and 12 gradient accumulation steps. All the above hyper-parameters are tuned on the validation set of SUPERNI. While we fine-tune Llama2 on all the datasets 3 epochs with batch size 18, and we set learning rate = 1e 4, lorar = 8, loraalpha = 16. Since the generation API provided by Huggging Face cannot support efficient batched evaluation, we fix the evaluation batch size to 1 for all the datasets. We truncate the inputs to 1024 tokens and limit the output length to 128 tokens, with beam search size = 1 (greedy decoding).