Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models

Authors: Yukang Yang, Declan Iain Campbell, Kaixuan Huang, Mengdi Wang, Jonathan D. Cohen, Taylor Whittington Webb

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To identify and validate the presence of these mechanisms, we perform experiments across three abstract reasoning tasks algebraic rule induction, letter string analogies, and verbal analogies and 13 open-source LLMs from four model families GPT-2 (Radford et al., 2019), Gemma2 (Gemma Team, 2024), Qwen2.5 (Qwen Team, 2025), and Llama-3.1 (Dubey et al., 2024) drawing on convergent evidence from a series of causal, representational, and attention analyses. We find robust evidence for these mechanisms across all three tasks, and three out of four model families (Gemma-2, Qwen2.5, and Llama-3.1; with more equivocal results for GPT-2).
Researcher Affiliation Collaboration 1Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ 2Princeton Neuroscience Institute, Princeton University, Princeton, NJ 3Microsoft Research, New York, NY.
Pseudocode Yes Algorithm 1 Causal Mediation Analysis
Open Source Code Yes All code and data will be released at https://github.com/yukang123/LLMSymb Mech.
Open Datasets Yes All code and data will be released at https://github.com/yukang123/LLMSymb Mech.
Dataset Splits Yes We created 500 prompts for each rule (ABA and ABB), each involving completely disjoint token sets. We split these into a training set of 200 prompts, a validation set of 100 prompts, and a test set of 200 prompts.
Hardware Specification Yes Experiments on Llama-3.1 70B and Qwen2.5 72B were conducted on two NVIDIA 80G H100 GPUs while experiments on other models of smaller sizes were conducted on a single H100 GPU.
Software Dependencies No All code was written in Python using the Transformer Lens and Hugging Face libraries. No specific version numbers for Python or the libraries are provided.
Experiment Setup Yes To evaluate performance on the rule induction task, we randomly selected English tokens from the LLama-3 vocabulary to form 2,000 prompts. We used the following 2-shot prompt format: A1ˆB1ˆA1\n A2ˆB2ˆA2\n A3ˆB3ˆ ... Permutation testing was performed to estimate the family-wise error rate across all scores in each plot, and scores were thresholded so that only scores significantly greater than zero (p < 0.05) are shown. ... We trained a one-layer linear probe to decode abstract variables (A vs. B) based on the outputs of symbol abstraction and symbolic induction heads.