Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On Understanding Attention-Based In-Context Learning for Categorical Data

Authors: Aaron T Wang, William Convertino, Xiang Cheng, Ricardo Henao, Lawrence Carin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the framework empirically on synthetic data, image classification and language generation. 2. We empirically validate our framework through experiments on diverse datasets: (a) We tackle in-context image classification on Image Net (Russakovsky et al., 2014)... (b) We apply our GD-based model to language generation, training on a combined corpus of Tiny Stories and Children Stories (Eldan & Li, 2023)...
Researcher Affiliation	Academia	1Electrical & Computer Engineering Dept., Duke University, Durham, NC, USA. Correspondence to: Lawrence Carin <EMAIL>.
Pseudocode	No	No explicit pseudocode or algorithm blocks are provided, but the model architecture and steps are described in prose and diagrams in Section 3 and Figures 1 and 2.
Open Source Code	Yes	Code needed to replicate our experiments is at https://github.com/aarontwang/icl_attention_categorical.
Open Datasets	Yes	We tackle in-context image classification on Image Net (Russakovsky et al., 2014)... We apply our GD-based model to language generation, training on a combined corpus of Tiny Stories and Children Stories (Eldan & Li, 2023)... 1https://huggingface.co/datasets/ajibawa-2023/Children-Stories-Collection
Dataset Splits	Yes	For each contextual set C(l), 5 distinct classes are selected uniformly at random, and for each such class 10 specific images are selected at random, and therefore N = 50 (image N + 1 is selected at random from the 5 class types considered in the context data). When training L = 2048, and test performance is averaged for M = 2048.
Hardware Specification	Yes	All experiments were performed on a Tesla V100 PCIe 16 GB GPU.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies such as libraries or programming languages used for their implementation. It only mentions the use of "GPT-4o model" for evaluation.
Experiment Setup	Yes	embedding vectors are learned for each token, with C = 50, 257 unique tokens represented and an embedding dimension d = 512; 8 attention heads are use for both models. Additionally, positional embedding vectors are learned for each of the 256 positions in our model s context window, with an additional 257th position learned for the GD model (for position x N+1).