Evaluating the Zero-shot Robustness of Instruction-tuned Language Models

Authors: Jiuding Sun, Chantal Shaib, Byron C Wallace

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To answer the former, we collect a set of 319 English instructions manually written by NLP practitioners for over 80 unique tasks included in widely used benchmarks, and we evaluate the variance and average performance of these instructions as compared to instruction phrasings observed during instruction fine-tuning. We find that using novel (unobserved) but appropriate instruction phrasings consistently degrades model performance, sometimes substantially so. Further, such natural instructions yield a wide variance in downstream performance, despite their semantic equivalence. Put another way, instruction-tuned models are not especially robust to instruction re-phrasings.
Researcher Affiliation Collaboration Anonymous authors Paper under double-blind review
Pseudocode No The paper does not contain any explicit
Open Source Code Yes All the results reported in the paper are reproducible. We submit the code and include all the implementation details in Appendix B.
Open Datasets Yes We include all 57 tasks from MMLU and 14 of 24 tasks from BBL.1 We use the same instructions for all tasks in the same category, taken from the instruction tuning datasets associated with each model. These instructions are general, e.g., in the case of classification they request that the model consider an example with respect to categorization criteria and label space provided by the instance, and select an appropriate category (examples in Table 1).
Dataset Splits No The paper mentions using
Hardware Specification Yes We conduct all training and ablation studies on 8 A100s with 80GB memory.
Software Dependencies No The paper mentions using specific models like GPT-4, Llama2-13B, and text-Da Vinci-003, and an optimizer (Adam W), but does not provide specific version numbers for software libraries or environments used for experimentation (e.g., PyTorch version, Hugging Face Transformers version).
Experiment Setup Yes We kept the KL-Loss weight fixed at 0.8. We trained both Flan-T5-XL and Alpaca-7B with a batch size of 4 and a batch gradient accumulation of 4. We set the weight decay to 1e 5 and the learning rate to 5e 4 for all experiments. We use a prefix length of 10.