Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Brain-Inspired fMRI-to-Text Decoding via Incremental and Wrap-Up Language Modeling

Authors: Wentao Lu, Dong Nie, Pengcheng Xue, Zheng Cui, Piji Li, Daoqiang Zhang, Xuyun Wen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on the two datasets demonstrate that our method significantly outperforms state-of-the-art approaches, with performance gains increasing as decoding length grows. The code is available at https://github.com/WENXUYUN/Cog Reader. 4 Experiments 4.1 Experimental Setup
Researcher Affiliation	Collaboration	1College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Nanjing, China 2Chat Alpha AI, California, USA
Pseudocode	Yes	Algorithm 1 BLEU-N Score calculation process Algorithm 2 ROUGE-1 Score (Precision, Recall, F1) calculation process Algorithm 3 BERTScore (Precision, Recall, F1) Calculation
Open Source Code	Yes	The code is available at https://github.com/WENXUYUN/Cog Reader.
Open Datasets	Yes	This study employs three neuroimaging datasets: HCP S1200 [32], Narratives [24] and Huth dataset [18]. The Human Connectome Project s HCP S1200 dataset provides extensive f MRI data from 1,206 healthy young adults across seven cognitive domains. The Narratives dataset, a paired f MRI-text benchmark, contains f MRI recordings from 345 participants during naturalistic auditory comprehension of 27 real-world narrative stories. The Huth dataset comprises f MRI data from 8 subjects recorded while they passively listened to naturally spoken English stories.
Dataset Splits	Yes	In the f MRI Repersentation Learning Stage, during HCP pretraining phase, we split the HCP dataset into training and testing sets in a 4:1 ratio. While for the Narratives dataset, to avoid text leakage, we adopt a stimulus split approach to ensure that the train, validation and test sets use different story content, with a ratio of 60%, 20% and 20%, during Narratives pretraining stage and f MRI-to-text decoding stage.
Hardware Specification	Yes	All experiments are conducted on CUDA 12.2 and the computer with NVIDIA Ge Force RTX 3090 GPU. The equipment used in the experiment is configured as follows: AMAX Tower Workstation TS40-X3, equipped with dual Intel Xeon 4316 CPUs (2.3 GHz, 20 cores), 256 GB of DDR4 memory (32 GB modules at 3200 MHz), a 480 GB SSD for the system disk, a 3.84 TB SSD for hot data, and a 16 TB 7200 RPM SATA enterprise HDD for data storage.
Software Dependencies	No	Our model is built using the Py Torch framework [28] and the Huggingface Transformers package [35]. All experiments are conducted on CUDA 12.2 and the computer with NVIDIA Ge Force RTX 3090 GPU. The paper mentions PyTorch and Huggingface Transformers but does not provide specific version numbers for these key software components. Only CUDA has a version specified.
Experiment Setup	Yes	Our model is built using the Py Torch framework [28] and the Huggingface Transformers package [35]. All models utilize the Adam optimizer [16], with a warmup strategy. All experiments are conducted on CUDA 12.2 and the computer with NVIDIA Ge Force RTX 3090 GPU. Additional implementation details can be found in the Appendix. 4.2 Parameter Settings This section discusses the optimal configuration of two key parameters for the f MRI-to-image decoding task. The evaluation is conducted on the Narrative dataset using BLEU-1, ROUGE-R, and BERTScore-R as performance metrics. Segment Length Ns: To determine the optimal segment length NS, we vary it from 10 to 70 in steps of 10. For each value of Ns, we train and test the Cog Reader framework accordingly. As shown in Figure 4, all metrics exhibit a trend of first increasing and then decreasing with longer segment lengths, reaching peak performance at NS = 20. Therefore, we set NS = 20 for all subsequent experiments. Dimensionality of the MLP: For the MLP dimensionality, we evaluate five configurations: 0, 32, 64, 128, and 256. The corresponding performance of Cog Reader under each setting is shown in Figure 5. Taking into account both word-level and semantic-level evaluation metrics, we ultimately set the MLP dimension to 128, as this configuration demonstrates consistently good and stable performance across all metrics. In comparison, other configurations perform well on at most a single metric. Appendix A.2 Experimental Settings includes Table 6, 7, and 8 with detailed parameter settings (e.g., mask ratio, epochs, batch size, optimizer, learning rate, embed dimensions, transformer depths/heads).