Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MAPLE: Many-Shot Adaptive Pseudo-Labeling for In-Context Learning
Authors: Zihan Chen, Song Wang, Zhen Tan, Jundong Li, Cong Shen
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on realworld datasets demonstrate the effectiveness of our framework, showcasing its ability to enhance LLM adaptability and performance with limited labeled data. Our code is provided at https: //github.com/Chen-1031/MAPLE_ICL. ... We conduct extensive experiments on various real-world datasets, and the results validate the effectiveness of our framework. Our main contributions are summarized as follows: ... Algorithm. We propose an influence-based mechanism to select and pseudo-label only the most impactful unlabeled samples and adaptively select demonstrations for each test query, ensuring strong performance without extensive pseudo-labeling. Practicality. Our approach significantly reduces the need for labeled data in many-shot ICL, thereby improving the feasibility of LLMs in real-world scenarios where labels are scarce. Through extensive experiments on diverse datasets, we demonstrate the superior performance of our framework over other baselines. |
| Researcher Affiliation | Academia | 1University of Virginia, Charlottesville, VA, USA 2Arizona State University, Tempe, AZ, USA. Correspondence to: Cong Shen <EMAIL>. |
| Pseudocode | No | The paper describes the methodology using prose and mathematical equations in Section 3 and its subsections. It does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block. Appendix A provides proofs for theorems but not pseudocode for the overall framework. |
| Open Source Code | Yes | Our code is provided at https: //github.com/Chen-1031/MAPLE_ICL. |
| Open Datasets | Yes | Datasets. We evaluate the effectiveness of our approach on eight datasets across four tasks. (1) Summarization: ... XSum dataset (Narayan et al., 2018) ... (2) Reasoning: ... Date, Salient, and Tracking7) from the Big Bench Hard (BBH) (Suzgun et al., 2023) ... (3) Classification: ... Financial Phrase Bank (FP) sentiment analysis (Malo et al., 2014; Wei & Liu, 2025) and a subset of challenging benchmark datasets (Li et al., 2024) that are specifically designed for ICL tasks with diverse classes and long inputs, including Banking77 and Go Emotion. (4) Question Answering: ... Google-Proof QA (GPQA) dataset (Rein et al., 2023)... |
| Dataset Splits | Yes | In our main experiment, we set k = 20 and α = 0.75, and we sample 1,000 demonstrations for labeling and 300 for testing. For datasets with fewer than 1,000 training samples or fewer than 300 test samples, we use the entire dataset. We randomly select 20 demonstrations to form DL. |
| Hardware Specification | No | The paper mentions using specific LLMs like 'Gemini 1.5 Flash' and 'Gemini 1.5 Pro' for evaluation, but it does not specify the underlying hardware (e.g., GPU models, CPU types, memory) on which these LLMs were run or the experiments were conducted. The analysis in Appendix D on KV Cache describes FLOPs for transformers in general but not the specific hardware used for the authors' experiments. |
| Software Dependencies | No | The paper mentions specific LLMs like 'Gemini 1.5 Flash', 'Gemini 1.5 Pro', and encoder models such as 'Contriver (Izacard et al., 2021)', 'Sentence BERT (SBert) (Reimers & Gurevych, 2019)', and 'De BERTa (He et al., 2020)'. However, it does not provide specific version numbers for these software components or any other libraries/frameworks (e.g., Python, PyTorch, TensorFlow versions) that would be needed for replication. |
| Experiment Setup | Yes | In our main experiment, we set k = 20 and α = 0.75, and we sample 1,000 demonstrations for labeling and 300 for testing. For datasets with fewer than 1,000 training samples or fewer than 300 test samples, we use the entire dataset. We randomly select 20 demonstrations to form DL. Unless specified otherwise, we evaluate the many-shot ICL performance of the Gemini 1.5 Flash (Team et al., 2024) model with 1M token context length. We apply the Contriver (Izacard et al., 2021) as fθ( ). We conduct experiments with pseudo-labeled sizes ranging from 20 to 100. For most tasks, while increasing the number of demonstrations further improves performance, many-shot ICL reaches a sufficiently good performance with around 100 to 27 demonstrations (Agarwal et al., 2024). The prompts used to elicit responses from ICL are provided in the Appendix B. Each experiment is run five times, and the average performance is reported. |