MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction Following
Authors: Renze Lou, Kai Zhang, Jian Xie, Yuxuan Sun, Janice Ahn, Hanzi Xu, Yu Su, Wenpeng Yin
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results across four zero-shot benchmarks, spanning both Scaling-Inputs and Scaling Input-Free Tasks schemes, reveal that LLMs, at various scales, trained on MUFFIN generally demonstrate superior instruction-following capabilities compared to those trained on the two aforementioned schemes. |
| Researcher Affiliation | Academia | The Pennsylvania State University; The Ohio State University; Fudan University; Westlake University; Temple University |
| Pseudocode | No | The paper describes its data construction pipeline in Figure 2 and details its strategies in text, but it does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | All the code and data are available at our project page: https://renzelou.github.io/Muffin/ |
| Open Datasets | Yes | SUPERNI (Wang et al., 2022) is a human-annotated dataset encompassing 1,600+ NLP tasks across diverse categories, sourced from existing benchmarks or created by human experts, implying a remarkable input diversity of SUPERNI. Therefore, we randomly select inputs from the training tasks of SUPERNI as our input text source (only inputs, no outputs are sampled at this stage). |
| Dataset Splits | Yes | All the above hyper-parameters are tuned on the validation set of SUPERNI. |
| Hardware Specification | Yes | All the experiments are done on NVIDIA A100 with 80 GPU memories. |
| Software Dependencies | Yes | All of our implementations are based on Hugging Face transformers (Wolf et al., 2019). |
| Experiment Setup | Yes | We fine-tune T5 on MUFFIN with 2 epochs. When fine-tuning T5-3B, we set the learning rate as 5e 5 with batch size 6. As for T5-11B, we set the learning rate as 1e 5 with batch size 1 and 12 gradient accumulation steps. All the above hyper-parameters are tuned on the validation set of SUPERNI. While we fine-tune Llama2 on all the datasets 3 epochs with batch size 18, and we set learning rate = 1e 4, lorar = 8, loraalpha = 16. Since the generation API provided by Huggging Face cannot support efficient batched evaluation, we fix the evaluation batch size to 1 for all the datasets. We truncate the inputs to 1024 tokens and limit the output length to 128 tokens, with beam search size = 1 (greedy decoding). |