Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
The Emergence of Abstract Thought in Large Language Models Beyond Any Language
Authors: Yuxin Chen, Yiran Zhao, Yang Zhang, An Zhang, Kenji Kawaguchi, Shafiq Joty, Junnan Li, Tat-Seng Chua, Michael Qizhe Shieh, Wenxuan Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across diverse LLM families support our approach.1 |
| Researcher Affiliation | Collaboration | 1 National University of Singapore 2 Salesforce AI Research 3 Peking University 4 Singapore University of Technology and Design |
| Pseudocode | Yes | A Parallel Neuron Detection Algorithm |
| Open Source Code | Yes | 1Our codes are available at https://github.com/chenyuxin1999/Abstract_Thought. |
| Open Datasets | Yes | Multilingual Massive Multitask Language Understanding (MMMLU) dataset (Open AI, 2024), a human-translated extension of the original MMLU benchmark (Hendrycks et al., 2021), available in 14 languages. In addition, we incorporate the Multilingual Grade School Math (MGSM) dataset (Shi et al., 2022), a translated version of GSM8K (Cobbe et al., 2021), which covers 10 languages. Together, these datasets provide quantitative measures of the models multilingual capabilities. ... For each language, we identify language-related neurons by analyzing activation patterns on 1000 sentences sampled from the OSCAR corpus (Abadji et al., 2022). ... we construct a training set by sampling 100,000 examples per language from a mixture of three widely used multilingual datasets: Culturax (Nguyen et al., 2024), MADLAD (Kudugunta et al., 2023), and Wikipedia (Guo et al., 2020). |
| Dataset Splits | Yes | For each language, we identify language-related neurons by analyzing activation patterns on 1000 sentences sampled from the OSCAR corpus (Abadji et al., 2022). ... Specifically, we construct a training set by sampling 100,000 examples per language from a mixture of three widely used multilingual datasets: Culturax (Nguyen et al., 2024), MADLAD (Kudugunta et al., 2023), and Wikipedia (Guo et al., 2020). |
| Hardware Specification | Yes | Lo RA requires longer training time (2.2 hours on 2 H200 GPUs) |
| Software Dependencies | No | The paper does not explicitly provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | To comprehensively evaluate the emergence of abstract thought in LLMs throughout their development, we examine 20 open-source models encompassing diverse model families, release periods, and sizes. ... We utilize Llama3.2-1B (Grattafiori et al., 2024), Llama3.2-3B, and Lamma3.1-8B as representative LLMs with low, medium, and high language-agnostic scores, respectively. We conduct experiments under three training settings: language-shared neurons, language-exclusive neurons, and an equal number of randomly selected neurons. ... for every query in a given language, we rank neurons based on their computed importance scores and select the top 1% as activated neurons. ... We implement Lo RA-based fine-tuning on LLa MA3.2-3B under a similar parameter budget (rank = 48). |