Are Human-generated Demonstrations Necessary for In-context Learning?

Authors: Rui Li, Guoyin Wang, Jiwei Li

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments in arithmetic reasoning, commonsense reasoning, multi-task language understanding, and code generation benchmarks, show that SEC, which does not require hand-crafted demonstrations, significantly outperforms the zero-shot learning strategy, and achieves comparable results to ICL with hand-crafted demonstrations.
Researcher Affiliation Collaboration Rui Li1, Guoyin Wang2, Jiwei Li3 1University of Science and Technology of China 2Bytedance 3Zhejiang University
Pseudocode No No explicit pseudocode or algorithm blocks were found in the paper. The methodology is described through text and examples of LLM prompts and outputs.
Open Source Code Yes Code is available at https://github.com/ruili33/SEC.1
Open Datasets Yes We evaluate SEC in the following tasks and datasets (details in Appendix A.1): Arithmetic Reasoning: GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021); Commonsense Reasoning: AI2 Reasoning Challenge (ARC) (Clark et al., 2018); Multi-task Language Understanding: MMLU (Hendrycks et al.), C-Eval (Huang et al., 2023); Code Generation: Human Eval (Chen et al., 2021).
Dataset Splits Yes For the ARC dataset, we used the rationale generated by GPT4 model via Chat GPT official website8 for the first five examples in the validation set.
Hardware Specification No For all our baselines, we adopt Chat GPT (gpt-3.5-turbo), GPT4 (Open AI, 2023) and Llama2 34B (Touvron et al., 2023) as the model backbone, details in Appendix A.2. If not specified otherwise, we are using GPT-3.5 for our experiments.
Software Dependencies No For all our baselines, we adopt Chat GPT (gpt-3.5-turbo), GPT4 (Open AI, 2023) and Llama2 34B (Touvron et al., 2023) as the model backbone, details in Appendix A.2.
Experiment Setup Yes The number of shots for different tasks and tasks are shown in Table 1. ... Then we have the LLM generate the demonstration again until it passes the validation. ... Then, we slightly alter the prompt while setting the temperature to 1 to add randomness.