Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Far from the Shallow: Brain-Predictive Reasoning Embedding through Residual Disentanglement

Authors: Linyang He, Tianjun Zhong, Richard Antonello, Gavin Mischler, Micah Goldblum, Nima Mesgarani

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We used these disentangled embeddings to model intracranial (ECo G) brain recordings from neurosurgical patients listening to natural speech. We show that: 1) This isolated reasoning embedding exhibits unique predictive power, accounting for variance in neural activity not explained by other linguistic features and even extending to the recruitment of visual regions beyond classical language areas. 2) The neural signature for reasoning is temporally distinct, peaking later (~350-400ms) than signals related to lexicon, syntax, and meaning, consistent with its position atop a processing hierarchy. 3) Standard, non-disentangled LLM embeddings can be misleading, as their predictive success is primarily attributable to linguistically shallow features, masking the more subtle contributions of deeper cognitive processing.
Researcher Affiliation	Academia	1Zuckerman Mind Brain Behavior Institute, Columbia University 2Department of Electrical Engineering, Columbia University 3Department of Computer Science, Columbia University
Pseudocode	Yes	Algorithm 1 Construction of Feature-Specific Residual Embeddings 1: Input: LLM hidden states {HL}Lmax L=0 for each token; probing datasets Ds, Dm, Dr 2: Output: Feature-specific embeddings El, Es, Em, Er 3: Perform probing with Ds, Dm, Dr to find saturation layers: 4: Ls syntax saturation layer from Ds 5: Lm meaning saturation layer from Dm 6: Lr reasoning saturation layer from Dr 7: Ll 0 8: Define lexical embedding: El HLl 9: for each (Llow, Lhigh) in {(Ll, Ls), (Ls, Lm), (Lm, Lr)} do 10: Train ridge regression g to predict HLhigh from HLlow 11: Compute residual embedding: E HLhigh g(HLlow) 12: Assign Es, Em, Er accordingly
Open Source Code	No	2Code available here. (...) Justification: All datasets used in this study are publicly available. For linguistic probing, we use BLi MP, COMPS-BASE, COMPS-WUGS, and COMPS-WUGS-DIST [40, 33]. For brain encoding analysis, we use the Podcast ECo G dataset, which includes aligned transcripts for one episode, and an expanded set of podcast transcripts introduced in [43, 30]. Code will be organized and released on a public platform upon publication.
Open Datasets	Yes	This work uses several publicly available datasets and a language model, all of which are properly cited and used in accordance with their respective licenses. No proprietary or scraped data was used. BLi MP Dataset [40]: A suite of syntactic probing tasks for language models. URL: https://github.com/alexwarstadt/blimp License: CC-BY 4.0 COMPS Datasets [33]: COMPS-BASE and COMPS-WUGS-DIST are used to assess semantic and reasoning representations. URL: https://github.com/kanishkamisra/comps/ License: Apache License 2.0 The Podcast ECo G Dataset [43]: High-resolution electrocorticographic recordings from participants listening to a natural podcast. URL: https://openneuro.org/datasets/ds005574/versions/1.0.2 License: CC0 Expanded Podcast Transcripts [30]: Text for additional podcast episodes used to extend ECo G analysis. URL: https://github.com/calclavia/tal-asrd License: No explicit license is specified on the repository. However, the author provides access to the dataset and indicates that it can be downloaded for research purposes. We used the dataset solely for non-commercial academic research and did not redistribute or modify it.
Dataset Splits	Yes	with α selected via 5-fold cross-validation over a log-spaced grid, and b = 5 bootstrap resamples per fold using contiguous chunks of length l = 32.
Hardware Specification	Yes	Hidden State Extraction: Performed using 4 NVIDIA L40 GPUs. Extracting hidden states for around 164,000 tokens from the Qwen2.5-14B model. The extraction took approximately 40 minutes and required around 4 30 GB of GPU memory. Layer-wise Probing: Conducted using 1 NVIDIA L40 GPUs. [...] Overall, the full pipeline can be reproduced on a modern workstation or cloud instance equipped with 1 NVIDIA L40 GPU + 30 GB RAM.
Software Dependencies	No	Residual Embedding Construction: Ridge regression training was done using Scikit-learn on CPU with 30 GB of memory. Each regression model took less than 10 minutes to converge. NVIDIA cu ML and GPU training was not adopted due to lack of support for multi-output ridge regression training.
Experiment Setup	Yes	We fit: W = arg min W Y XW 2 F + α W 2 F , with α selected via 5-fold cross-validation over a log-spaced grid, and b = 5 bootstrap resamples per fold using contiguous chunks of length l = 32. Model performance is quantified by Pearson correlation between predicted and actual signals. [...] Null distribution and responsiveness criterion. Considering that different channels and features have varying signal-to-noise ratios (SNRs), we constructed a subject electrode specific null distribution to assess whether a feature block explains neural activity beyond chance and to enable cross-electrode analysis. This was done by shuffling the feature rows 500 times while keeping the word-onset covariates fixed. [...] Electrodes with z > 3.95 (one-tailed α = .05, Bonferroni-corrected across N = 1268 electrodes) were deemed responsive, corresponding to values exceeding 3.95 standard deviations above the shuffle mean.