Many-Shot In-Context Learning
Authors: Rishabh Agarwal, Avi Singh, Lei Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, Hugo Larochelle
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We systematically evaluate ICL performance at different scales of in-context examples for a wide range of tasks with Gemini 1.5 Pro. |
| Researcher Affiliation | Industry | Rishabh Agarwal , Avi Singh , Lei Zhang , Bernd Bohnet , Luis Rosias , Stephanie Chan , Biao Zhang , Ankesh Anand, Zaheer Abbas, Azade Nova, John D. Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, Hugo Larochelle Google Deep Mind |
| Pseudocode | Yes | Listing 1: Code for Generating Sythetic datasets for Linear Classification in High Dimensions. |
| Open Source Code | No | Unfortunately, our experiments depend on internal infrastructure and code that can not be made fully public. |
| Open Datasets | Yes | We investigate how scaling the number of shots affects ICL performance on a wide variety of tasks ( 2): problem solving using MATH [22] and GSM8K [10], question-answering [GPQA, 51], summarization using XSum [42] and XLSum [19], algorithmic reasoning [BBH, 55], reward modeling [Code verification, A.5], low-resource machine translation [FLORES, 17], planning [Logistics, 53], and sentiment analysis [FP, 39]. |
| Dataset Splits | Yes | For reliable results, we randomly sample in-context examples for each K-shot prompt multiple times using different random seeds and report average performance, along with some visualization for performance on individual seeds. |
| Hardware Specification | No | It is typically not possible to infer details about compute resources as experiments depends on the Gemini 1.5 Pro API. |
| Software Dependencies | Yes | We use KV caching [48]... We use the default implementation of k-nearest neighbours (with k = 5) from scikit-learn [47]. |
| Experiment Setup | Yes | For reliable results, we randomly sample in-context examples for each K-shot prompt multiple times using different random seeds and report average performance, along with some visualization for performance on individual seeds. To ensure that using more shots provides additional information, any K-shot prompt in our setup includes all in-context examples from prompts with less than K examples... The best values for both models (fastest learning) were max_lr=1e-4, warmup_steps=1000. |