Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Disentangling Superpositions: Interpretable Brain Encoding Model with Sparse Concept Atoms

Authors: Alicia Zeng, Jack Gallant

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	When applied to f MRI data collected during story listening, our model matches the prediction performance of conventional dense models while substantially enhancing interpretability. It enables novel neuroscientific analyses such as disentangling overlapping cortical representations of time, space, and number, and revealing structured similarity among distributed conceptual maps.
Researcher Affiliation	Academia	Alicia Zeng Biophysics Program University of California, Berkeley Berkeley, CA 94720 EMAIL Jack Gallant Department of Neuroscience University of California, Berkeley Berkeley, CA 94720 EMAIL
Pseudocode	No	The paper describes methods and processes in paragraph form and mathematical equations but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	All analyses were implemented in Python using a custom framework called sparseconcept, which is openly available on Git Hub.4.
Open Datasets	Yes	The raw f MRI dataset used in this study is publicly available via the Gallant Lab GIN repository.1 Preprocessed f MRI data and aligned stimulus features are also available on OSF.2
Dataset Splits	Yes	The full dataset comprised eleven stories per participant. Ten stories were used for training the encoding models, yielding approximately 125 minutes of data per subject. One remaining story was held out for testing and presented twice to each participant. To improve the signal-to-noise ratio in the test data, BOLD responses across the two repetitions were averaged. Final model evaluations were conducted on this averaged 10-minute test set.
Hardware Specification	Yes	Training required approximately 30 60 minutes per subject on a single NVIDIA RTX A6000 GPU.
Software Dependencies	No	The analysis pipeline relies on standard scientific Python libraries including numpy, scipy, matplotlib, scikit-learn, statsmodels and pycortex, and uses the himalaya package with a Py Torch backend for efficient voxelwise model fitting. No specific version numbers are provided for these libraries.
Experiment Setup	Yes	For each voxel, two regularization hyperparameters (λsemantic and λlow-level) were optimized using leave-one-run-out cross-validation on the training set. The procedure was as follows: The 10 training runs were split into 9 training and 1 validation run, iterated over all folds. For each fold, 20 logarithmically spaced values (from 101 to 1020) were tested for each hyperparameter. Prediction accuracy was computed on the held-out run. The hyperparameter pair yielding the highest average prediction accuracy across folds was selected separately for each voxel.