reproducibilityindex.ai

InstructZero: Efficient Instruction Optimization for Black-Box Large Language Models

Authors: Lichang Chen, Jiuhai Chen, Tom Goldstein, Heng Huang, Tianyi Zhou

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate INSTRUCTZERO on different combinations of open-source LLMs and APIs including Vicuna and Chat GPT. INSTRUCTZERO outperforms SOTA auto-instruction methods across a variety of downstream tasks. Our code is available: https://github.com/ Lichang-Chen/Instruct Zero. Extensive experiments demonstrate that our method could effectively generate instructions that enhance task performance while achieving predictions on par with or even superior to those created by previous methods.
Researcher Affiliation	Academia	1Department of Computer Science, University of Maryland, College Park. Correspondence to: Lichang Chen <bobchen@cs.umd.edu>, Jiuhai Chen <jchen169@umd.edu>.
Pseudocode	Yes	The complete procedure is provided in Algorithm 1.
Open Source Code	Yes	Our code is available: https://github.com/ Lichang-Chen/Instruct Zero.
Open Datasets	Yes	We assess the effectiveness of zero-shot in-context learning on instruction tasks proposed in (Honovich et al., 2022), including all 24 tasks used in previous auto-instruction work (Zhou et al., 2022). We further add 8 extra tasks to enrich the benchmark for evaluating all methods in more comprehensive scenarios spanning many facets of language understanding. We provide detailed descriptions of each task in the Appendix. Training-set examples can be used for instruction optimization but the final instruction p is evaluated on a held-out test set. Zero-shot performance H(p) on the test set is reported.
Dataset Splits	Yes	For each task, we draw τ = 5 and 20 samples from the training set as the exemplars and validation set Dt, respectively.
Hardware Specification	Yes	All training and tests are conducted on a NVIDIA RTX A6000 GPU.
Software Dependencies	No	The paper mentions several LLMs and APIs used (e.g., Vicuna, Chat GPT, GPT-3.5-turbo, LLa MA, Stanford Alpaca, GPT-4, Claude, Pa LM-2). However, it does not specify explicit version numbers for these or any other software components, libraries, or frameworks used in the experiments.
Experiment Setup	Yes	For each task, we draw τ = 5 and 20 samples from the training set as the exemplars and validation set Dt, respectively. For the number of tokens in soft prompts, we search for the best value among {3, 5, 10} based on the validation set performance. We draw entries of the random projection matrix A from a uniform distribution between [ 1, 1]. The dimensionality d of p is set to 10. In experiments, we apply a mini-batch version of INSTRUCTZERO that explores 25 soft prompts in every iteration.