Evaluating the Zero-shot Robustness of Instruction-tuned Language Models
Authors: Jiuding Sun, Chantal Shaib, Byron C Wallace
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To answer the former, we collect a set of 319 English instructions manually written by NLP practitioners for over 80 unique tasks included in widely used benchmarks, and we evaluate the variance and average performance of these instructions as compared to instruction phrasings observed during instruction fine-tuning. We find that using novel (unobserved) but appropriate instruction phrasings consistently degrades model performance, sometimes substantially so. Further, such natural instructions yield a wide variance in downstream performance, despite their semantic equivalence. Put another way, instruction-tuned models are not especially robust to instruction re-phrasings. |
| Researcher Affiliation | Collaboration | Anonymous authors Paper under double-blind review |
| Pseudocode | No | The paper does not contain any explicit |
| Open Source Code | Yes | All the results reported in the paper are reproducible. We submit the code and include all the implementation details in Appendix B. |
| Open Datasets | Yes | We include all 57 tasks from MMLU and 14 of 24 tasks from BBL.1 We use the same instructions for all tasks in the same category, taken from the instruction tuning datasets associated with each model. These instructions are general, e.g., in the case of classification they request that the model consider an example with respect to categorization criteria and label space provided by the instance, and select an appropriate category (examples in Table 1). |
| Dataset Splits | No | The paper mentions using |
| Hardware Specification | Yes | We conduct all training and ablation studies on 8 A100s with 80GB memory. |
| Software Dependencies | No | The paper mentions using specific models like GPT-4, Llama2-13B, and text-Da Vinci-003, and an optimizer (Adam W), but does not provide specific version numbers for software libraries or environments used for experimentation (e.g., PyTorch version, Hugging Face Transformers version). |
| Experiment Setup | Yes | We kept the KL-Loss weight fixed at 0.8. We trained both Flan-T5-XL and Alpaca-7B with a batch size of 4 and a batch gradient accumulation of 4. We set the weight decay to 1e 5 and the learning rate to 5e 4 for all experiments. We use a prefix length of 10. |