Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models
Authors: Yukang Yang, Declan Iain Campbell, Kaixuan Huang, Mengdi Wang, Jonathan D. Cohen, Taylor Whittington Webb
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To identify and validate the presence of these mechanisms, we perform experiments across three abstract reasoning tasks algebraic rule induction, letter string analogies, and verbal analogies and 13 open-source LLMs from four model families GPT-2 (Radford et al., 2019), Gemma2 (Gemma Team, 2024), Qwen2.5 (Qwen Team, 2025), and Llama-3.1 (Dubey et al., 2024) drawing on convergent evidence from a series of causal, representational, and attention analyses. We find robust evidence for these mechanisms across all three tasks, and three out of four model families (Gemma-2, Qwen2.5, and Llama-3.1; with more equivocal results for GPT-2). |
| Researcher Affiliation | Collaboration | 1Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ 2Princeton Neuroscience Institute, Princeton University, Princeton, NJ 3Microsoft Research, New York, NY. |
| Pseudocode | Yes | Algorithm 1 Causal Mediation Analysis |
| Open Source Code | Yes | All code and data will be released at https://github.com/yukang123/LLMSymb Mech. |
| Open Datasets | Yes | All code and data will be released at https://github.com/yukang123/LLMSymb Mech. |
| Dataset Splits | Yes | We created 500 prompts for each rule (ABA and ABB), each involving completely disjoint token sets. We split these into a training set of 200 prompts, a validation set of 100 prompts, and a test set of 200 prompts. |
| Hardware Specification | Yes | Experiments on Llama-3.1 70B and Qwen2.5 72B were conducted on two NVIDIA 80G H100 GPUs while experiments on other models of smaller sizes were conducted on a single H100 GPU. |
| Software Dependencies | No | All code was written in Python using the Transformer Lens and Hugging Face libraries. No specific version numbers for Python or the libraries are provided. |
| Experiment Setup | Yes | To evaluate performance on the rule induction task, we randomly selected English tokens from the LLama-3 vocabulary to form 2,000 prompts. We used the following 2-shot prompt format: A1ˆB1ˆA1\n A2ˆB2ˆA2\n A3ˆB3ˆ ... Permutation testing was performed to estimate the family-wise error rate across all scores in each plot, and scores were thresholded so that only scores significantly greater than zero (p < 0.05) are shown. ... We trained a one-layer linear probe to decode abstract variables (A vs. B) based on the outputs of symbol abstraction and symbolic induction heads. |