reproducibilityindex.ai

Are Human-generated Demonstrations Necessary for In-context Learning?

Authors: Rui Li, Guoyin Wang, Jiwei Li

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments in arithmetic reasoning, commonsense reasoning, multi-task language understanding, and code generation benchmarks, show that SEC, which does not require hand-crafted demonstrations, significantly outperforms the zero-shot learning strategy, and achieves comparable results to ICL with hand-crafted demonstrations.
Researcher Affiliation	Collaboration	Rui Li1, Guoyin Wang2, Jiwei Li3 1University of Science and Technology of China 2Bytedance 3Zhejiang University
Pseudocode	No	No explicit pseudocode or algorithm blocks were found in the paper. The methodology is described through text and examples of LLM prompts and outputs.
Open Source Code	Yes	Code is available at https://github.com/ruili33/SEC.1
Open Datasets	Yes	We evaluate SEC in the following tasks and datasets (details in Appendix A.1): Arithmetic Reasoning: GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021); Commonsense Reasoning: AI2 Reasoning Challenge (ARC) (Clark et al., 2018); Multi-task Language Understanding: MMLU (Hendrycks et al.), C-Eval (Huang et al., 2023); Code Generation: Human Eval (Chen et al., 2021).
Dataset Splits	Yes	For the ARC dataset, we used the rationale generated by GPT4 model via Chat GPT official website8 for the first five examples in the validation set.
Hardware Specification	No	For all our baselines, we adopt Chat GPT (gpt-3.5-turbo), GPT4 (Open AI, 2023) and Llama2 34B (Touvron et al., 2023) as the model backbone, details in Appendix A.2. If not specified otherwise, we are using GPT-3.5 for our experiments.
Software Dependencies	No	For all our baselines, we adopt Chat GPT (gpt-3.5-turbo), GPT4 (Open AI, 2023) and Llama2 34B (Touvron et al., 2023) as the model backbone, details in Appendix A.2.
Experiment Setup	Yes	The number of shots for different tasks and tasks are shown in Table 1. ... Then we have the LLM generate the demonstration again until it passes the validation. ... Then, we slightly alter the prompt while setting the temperature to 1 to add randomness.