reproducibilityindex.ai

Many-Shot In-Context Learning

Authors: Rishabh Agarwal, Avi Singh, Lei Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, Hugo Larochelle

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We systematically evaluate ICL performance at different scales of in-context examples for a wide range of tasks with Gemini 1.5 Pro.
Researcher Affiliation	Industry	Rishabh Agarwal , Avi Singh , Lei Zhang , Bernd Bohnet , Luis Rosias , Stephanie Chan , Biao Zhang , Ankesh Anand, Zaheer Abbas, Azade Nova, John D. Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, Hugo Larochelle Google Deep Mind
Pseudocode	Yes	Listing 1: Code for Generating Sythetic datasets for Linear Classification in High Dimensions.
Open Source Code	No	Unfortunately, our experiments depend on internal infrastructure and code that can not be made fully public.
Open Datasets	Yes	We investigate how scaling the number of shots affects ICL performance on a wide variety of tasks ( 2): problem solving using MATH [22] and GSM8K [10], question-answering [GPQA, 51], summarization using XSum [42] and XLSum [19], algorithmic reasoning [BBH, 55], reward modeling [Code verification, A.5], low-resource machine translation [FLORES, 17], planning [Logistics, 53], and sentiment analysis [FP, 39].
Dataset Splits	Yes	For reliable results, we randomly sample in-context examples for each K-shot prompt multiple times using different random seeds and report average performance, along with some visualization for performance on individual seeds.
Hardware Specification	No	It is typically not possible to infer details about compute resources as experiments depends on the Gemini 1.5 Pro API.
Software Dependencies	Yes	We use KV caching [48]... We use the default implementation of k-nearest neighbours (with k = 5) from scikit-learn [47].
Experiment Setup	Yes	For reliable results, we randomly sample in-context examples for each K-shot prompt multiple times using different random seeds and report average performance, along with some visualization for performance on individual seeds. To ensure that using more shots provides additional information, any K-shot prompt in our setup includes all in-context examples from prompts with less than K examples... The best values for both models (fastest learning) were max_lr=1e-4, warmup_steps=1000.