Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Technical Debt in In-Context Learning: Diminishing Efficiency in Long Context

Authors: Taejong Joo, Diego Klabjan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To quantify optimality of ICL as a learning algorithm, we compare ICL s sample complexityrelated measures to those of principled learning algorithms by revisiting the performance profiles [20] classic benchmarking framework for optimization software. As a result, we uncover a new insight on optimality of ICL in 3: While ICL with few-shot demonstrations achieves near optimal sample complexity, ICL s sample complexity sharply deteriorates as the number of demonstrations increases in long context. Concretely, many-shot ICL often requires 1.5 times more demonstrations than the Bayes optimal estimator to achieve the same performance. This indicates that, although transformers are theoretically capable of implementing principled algorithms in-context [19], their incontext learning behavior deviates significantly from the optimal learning algorithm in the many-shot regime. We further present evidence that, unlike principled algorithms, ICL may lack fundamental statistical properties (e.g., consistency and asymptotic efficiency) that are critical for algorithms to effectively learn from large demonstration sizes.
Researcher Affiliation	Academia	Taejong Joo & Diego Klabjan Department of Industrial Engineering & Management Sciences Northwestern University Evanston, IL, USA EMAIL
Pseudocode	No	The paper describes methodologies, objectives, and theoretical analyses using mathematical equations and text, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1Our source code is available at https://github.com/tjoo512/technical-debt-in-icl.
Open Datasets	No	For the data generating distribution of a prompt HT , we follow the approach of sampling target functions f from a hierarchical distribution [21] to capture a more interesting aspect of a learning algorithm model selection.
Dataset Splits	No	The paper describes a synthetic data generation process for each prompt/task and defines training and test context lengths (Ttrain and T). However, it does not specify traditional training, validation, and test dataset splits from a pre-existing static dataset, as the data is generated on the fly per task instance.
Hardware Specification	Yes	In this work, we use multiple servers which consist of multiple GPUs including RTX 8000 (50GB) and A100 (40GB).
Software Dependencies	No	The paper mentions using the GPT-2 architecture and the Adam optimizer but does not specify version numbers for general software dependencies like programming languages, frameworks (e.g., PyTorch, TensorFlow), or other libraries.
Experiment Setup	Yes	For the model, we use the GPT-2 [22] architecture for TFθ, which is a standard architecture in the meta ICL and other stylized experimental settings; that is, we define TFθ as a decoder-only transformers [49] with 12 layers, 8 attention heads, and 256-dimensional embedding space. For minimizing the ICL objective l(θ), we compute the stochastic gradient with 64 prompts and update θ by using the Adam optimizer [52] with fixed learning rate of 10 4 for one million training iterations. Also, in order to boost the convergence speed, we use curriculum learning [53] as recommended in [16, 21] by increasing the length of the prompt by 2 every 2,000 training iterations until it reaches (2M + 1) (and the order of Fourier series by 1 until it reaches M).