Are Human-generated Demonstrations Necessary for In-context Learning?
Authors: Rui Li, Guoyin Wang, Jiwei Li
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments in arithmetic reasoning, commonsense reasoning, multi-task language understanding, and code generation benchmarks, show that SEC, which does not require hand-crafted demonstrations, significantly outperforms the zero-shot learning strategy, and achieves comparable results to ICL with hand-crafted demonstrations. |
| Researcher Affiliation | Collaboration | Rui Li1, Guoyin Wang2, Jiwei Li3 1University of Science and Technology of China 2Bytedance 3Zhejiang University |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found in the paper. The methodology is described through text and examples of LLM prompts and outputs. |
| Open Source Code | Yes | Code is available at https://github.com/ruili33/SEC.1 |
| Open Datasets | Yes | We evaluate SEC in the following tasks and datasets (details in Appendix A.1): Arithmetic Reasoning: GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021); Commonsense Reasoning: AI2 Reasoning Challenge (ARC) (Clark et al., 2018); Multi-task Language Understanding: MMLU (Hendrycks et al.), C-Eval (Huang et al., 2023); Code Generation: Human Eval (Chen et al., 2021). |
| Dataset Splits | Yes | For the ARC dataset, we used the rationale generated by GPT4 model via Chat GPT official website8 for the first five examples in the validation set. |
| Hardware Specification | No | For all our baselines, we adopt Chat GPT (gpt-3.5-turbo), GPT4 (Open AI, 2023) and Llama2 34B (Touvron et al., 2023) as the model backbone, details in Appendix A.2. If not specified otherwise, we are using GPT-3.5 for our experiments. |
| Software Dependencies | No | For all our baselines, we adopt Chat GPT (gpt-3.5-turbo), GPT4 (Open AI, 2023) and Llama2 34B (Touvron et al., 2023) as the model backbone, details in Appendix A.2. |
| Experiment Setup | Yes | The number of shots for different tasks and tasks are shown in Table 1. ... Then we have the LLM generate the demonstration again until it passes the validation. ... Then, we slightly alter the prompt while setting the temperature to 1 to add randomness. |