Many-Shot In-Context Learning

Authors: Rishabh Agarwal, Avi Singh, Lei Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, Hugo Larochelle

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We systematically evaluate ICL performance at different scales of in-context examples for a wide range of tasks with Gemini 1.5 Pro.
Researcher Affiliation Industry Rishabh Agarwal , Avi Singh , Lei Zhang , Bernd Bohnet , Luis Rosias , Stephanie Chan , Biao Zhang , Ankesh Anand, Zaheer Abbas, Azade Nova, John D. Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, Hugo Larochelle Google Deep Mind
Pseudocode Yes Listing 1: Code for Generating Sythetic datasets for Linear Classification in High Dimensions.
Open Source Code No Unfortunately, our experiments depend on internal infrastructure and code that can not be made fully public.
Open Datasets Yes We investigate how scaling the number of shots affects ICL performance on a wide variety of tasks ( 2): problem solving using MATH [22] and GSM8K [10], question-answering [GPQA, 51], summarization using XSum [42] and XLSum [19], algorithmic reasoning [BBH, 55], reward modeling [Code verification, A.5], low-resource machine translation [FLORES, 17], planning [Logistics, 53], and sentiment analysis [FP, 39].
Dataset Splits Yes For reliable results, we randomly sample in-context examples for each K-shot prompt multiple times using different random seeds and report average performance, along with some visualization for performance on individual seeds.
Hardware Specification No It is typically not possible to infer details about compute resources as experiments depends on the Gemini 1.5 Pro API.
Software Dependencies Yes We use KV caching [48]... We use the default implementation of k-nearest neighbours (with k = 5) from scikit-learn [47].
Experiment Setup Yes For reliable results, we randomly sample in-context examples for each K-shot prompt multiple times using different random seeds and report average performance, along with some visualization for performance on individual seeds. To ensure that using more shots provides additional information, any K-shot prompt in our setup includes all in-context examples from prompts with less than K examples... The best values for both models (fastest learning) were max_lr=1e-4, warmup_steps=1000.