Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models

Authors: Yukang Yang, Declan Iain Campbell, Kaixuan Huang, Mengdi Wang, Jonathan D. Cohen, Taylor Whittington Webb

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To identify and validate the presence of these mechanisms, we perform experiments across three abstract reasoning tasks algebraic rule induction, letter string analogies, and verbal analogies and 13 open-source LLMs from four model families GPT-2 (Radford et al., 2019), Gemma2 (Gemma Team, 2024), Qwen2.5 (Qwen Team, 2025), and Llama-3.1 (Dubey et al., 2024) drawing on convergent evidence from a series of causal, representational, and attention analyses. We find robust evidence for these mechanisms across all three tasks, and three out of four model families (Gemma-2, Qwen2.5, and Llama-3.1; with more equivocal results for GPT-2).
Researcher Affiliation	Collaboration	1Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ 2Princeton Neuroscience Institute, Princeton University, Princeton, NJ 3Microsoft Research, New York, NY.
Pseudocode	Yes	Algorithm 1 Causal Mediation Analysis
Open Source Code	Yes	All code and data will be released at https://github.com/yukang123/LLMSymb Mech.
Open Datasets	Yes	All code and data will be released at https://github.com/yukang123/LLMSymb Mech.
Dataset Splits	Yes	We created 500 prompts for each rule (ABA and ABB), each involving completely disjoint token sets. We split these into a training set of 200 prompts, a validation set of 100 prompts, and a test set of 200 prompts.
Hardware Specification	Yes	Experiments on Llama-3.1 70B and Qwen2.5 72B were conducted on two NVIDIA 80G H100 GPUs while experiments on other models of smaller sizes were conducted on a single H100 GPU.
Software Dependencies	No	All code was written in Python using the Transformer Lens and Hugging Face libraries. No specific version numbers for Python or the libraries are provided.
Experiment Setup	Yes	To evaluate performance on the rule induction task, we randomly selected English tokens from the LLama-3 vocabulary to form 2,000 prompts. We used the following 2-shot prompt format: A1ˆB1ˆA1\n A2ˆB2ˆA2\n A3ˆB3ˆ ... Permutation testing was performed to estimate the family-wise error rate across all scores in each plot, and scores were thresholded so that only scores significantly greater than zero (p < 0.05) are shown. ... We trained a one-layer linear probe to decode abstract variables (A vs. B) based on the outputs of symbol abstraction and symbolic induction heads.