Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Why Larger Language Models Do In-context Learning Differently?
Authors: Zhenmei Shi, Junyi Wei, Zhuoyan Xu, Yingyu Liang
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Preliminary experimental results on large base and chat models provide positive support for our analysis. |
| Researcher Affiliation | Academia | 1University of Wisconsin-Madison, 2The University of Hong Kong. |
| Pseudocode | No | The paper contains mathematical derivations and proofs but no explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about providing open-source code for its methodology or a link to a code repository. |
| Open Datasets | Yes | We conduct experiments on five prevalent NLP tasks, leveraging datasets from GLUE (Wang et al., 2018) tasks and Subj (Conneau & Kiela, 2018). |
| Dataset Splits | No | The paper mentions using "M = 16 in-context exemplars" but does not provide specific training, validation, and test dataset splits for the datasets used (GLUE, Subj). |
| Hardware Specification | No | The paper does not provide any specific hardware details such as CPU or GPU models, or memory specifications, used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | We follow the prior work on in-context learning (Wei et al., 2023b) and use M = 16 in-context exemplars. ... Accuracy is calculated over 1000 evaluation prompts per dataset and over 5 runs with different random seeds for each evaluation... we introduce noise by inverting an escalating percentage of in-context example labels. |