Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Interpretable Next-token Prediction via the Generalized Induction Head

Authors: Eunji Kim, Sriya Mantena, Weiwei Yang, Chandan Singh, Sungroh Yoon, Jianfeng Gao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate GIM in two settings: language modeling and f MRI response prediction. In language modeling, GIM improves next-token prediction by up to 25%p over interpretable baselines, significantly narrowing the gap with black-box LLMs. In an f MRI setting, GIM improves neural response prediction by 20% and offers insight into the language selectivity of the brain.
Researcher Affiliation	Collaboration	1 Microsoft Research 2 Department of Electrical and Computer Engineering, Seoul National University 3 Stanford University 4 Interdisciplinary Program in Artificial Intelligence, Seoul National University
Pseudocode	No	The paper describes methods using natural language, equations, and figures illustrating the pipeline, but it does not contain explicit 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	The code is available at https://github.com/ejkim47/generalized-induction-head.
Open Datasets	Yes	We use 4 text datasets for evaluation: Baby LM [68], Open Web Text [17], Pile [69], and Fine Web ([70]; sample-10BT subset), using some as the reference corpus and some as test datasets (Table 1). [...] We analyzed publicly available data6 from [72] and [73], in which three human participants listened to 20+ hours of English-laguage podcast narratives while their f MRI responses were recorded across 95,556 cortical voxels. Footnote 6: https://github.com/Open Neuro Datasets/ds003020
Dataset Splits	Yes	When testing, we report performance on 100k sequences randomly sampled with a context length of 1024 and a stride of 512 [14, 27]. [...] To identify the optimal value of τ, we conducted cross-validation using the Baby LM training set (100M tokens). [...] fit linear models to map these embeddings to f MRI responses on the training split (24 stories), and evaluated performance on the test split (2 stories) using bootstrapped ridge regression.
Hardware Specification	Yes	Training is conducted on four NVIDIA A100 GPUs. [...] we conduct experiments in two environments: one with a single NVIDIA A40 GPU and 128 CPU cores, and another with two NVIDIA H100 GPUs and 64 CPU cores.
Software Dependencies	No	The paper mentions specific tools like 'FSL 5.0' for preprocessing and components like 'GPT-2 tokenizer', 'LLa MA-2 tokenizer', and 'Adam W optimizer [95]', but it does not provide specific version numbers for the core programming languages or machine learning frameworks (e.g., Python, PyTorch/TensorFlow) used for the main implementation.
Experiment Setup	Yes	we set τ to 8 and 9 for the GPT-2 and LLa MA-2 tokenizers, respectively, based on cross-validation results (see Appendix A.3 for details). [...] During inference, the maximum length for exact matching with Infini-gram is 500, and we use window size k for fuzzy matching as 32 and 64 for GPT-2 and LLa MA-2 tokenizers, respectively. [...] The Fuzzy Matching Model is trained with a combination of Cross Entropy (CE) loss and reverse Kullback-Leibler divergence (KLD) loss (Fig. 2(a)). In each training batch, we generate similarity pairs from randomly sampled sequences. The CE loss aids in identifying the most similar pairs. The reverse KLD loss guides the model to follow the overall similarity distribution, ensuring that close pairs receive high scores while distant pairs receive low scores. [...] Gradients are accumulated over 16 iterations, and we use the Adam W optimizer [95] with a learning rate of 0.0001 and a weight decay of 0.1. The learning rate follows a cosine schedule with a warmup over the first 1,000 iterations, and training continues for 15,000 or 20,000 iterations.