Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Emergence of Abstract Thought in Large Language Models Beyond Any Language

Authors: Yuxin Chen, Yiran Zhao, Yang Zhang, An Zhang, Kenji Kawaguchi, Shafiq Joty, Junnan Li, Tat-Seng Chua, Michael Qizhe Shieh, Wenxuan Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across diverse LLM families support our approach.1
Researcher Affiliation	Collaboration	1 National University of Singapore 2 Salesforce AI Research 3 Peking University 4 Singapore University of Technology and Design
Pseudocode	Yes	A Parallel Neuron Detection Algorithm
Open Source Code	Yes	1Our codes are available at https://github.com/chenyuxin1999/Abstract_Thought.
Open Datasets	Yes	Multilingual Massive Multitask Language Understanding (MMMLU) dataset (Open AI, 2024), a human-translated extension of the original MMLU benchmark (Hendrycks et al., 2021), available in 14 languages. In addition, we incorporate the Multilingual Grade School Math (MGSM) dataset (Shi et al., 2022), a translated version of GSM8K (Cobbe et al., 2021), which covers 10 languages. Together, these datasets provide quantitative measures of the models multilingual capabilities. ... For each language, we identify language-related neurons by analyzing activation patterns on 1000 sentences sampled from the OSCAR corpus (Abadji et al., 2022). ... we construct a training set by sampling 100,000 examples per language from a mixture of three widely used multilingual datasets: Culturax (Nguyen et al., 2024), MADLAD (Kudugunta et al., 2023), and Wikipedia (Guo et al., 2020).
Dataset Splits	Yes	For each language, we identify language-related neurons by analyzing activation patterns on 1000 sentences sampled from the OSCAR corpus (Abadji et al., 2022). ... Specifically, we construct a training set by sampling 100,000 examples per language from a mixture of three widely used multilingual datasets: Culturax (Nguyen et al., 2024), MADLAD (Kudugunta et al., 2023), and Wikipedia (Guo et al., 2020).
Hardware Specification	Yes	Lo RA requires longer training time (2.2 hours on 2 H200 GPUs)
Software Dependencies	No	The paper does not explicitly provide specific software dependencies with version numbers.
Experiment Setup	Yes	To comprehensively evaluate the emergence of abstract thought in LLMs throughout their development, we examine 20 open-source models encompassing diverse model families, release periods, and sizes. ... We utilize Llama3.2-1B (Grattafiori et al., 2024), Llama3.2-3B, and Lamma3.1-8B as representative LLMs with low, medium, and high language-agnostic scores, respectively. We conduct experiments under three training settings: language-shared neurons, language-exclusive neurons, and an equal number of randomly selected neurons. ... for every query in a given language, we rank neurons based on their computed importance scores and select the top 1% as activated neurons. ... We implement Lo RA-based fine-tuning on LLa MA3.2-3B under a similar parameter budget (rank = 48).