Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Brain-Informed Fine-Tuning for Improved Multilingual Understanding in Language Models

Authors: Anuja Negi, SUBBAREDDY OOTA, Anwar Nunez-Elizalde, Manish Gupta, Fatma Deniz

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results show that bilingual brain-informed fine-tuned language models outperform their vanilla (pretrained) counterparts in both brain encoding performance and most downstream NLP tasks across multiple languages. These findings suggest that brain-informed fine-tuning improves multilingual understanding in language models, offering a bridge between cognitive neuroscience and NLP research. We make our code publicly available.
Researcher Affiliation	Collaboration	Anuja Negi Technical University of Berlin Berlin, Germany 10623 EMAIL Subba Reddy Oota Technical University of Berlin Berlin, Germany 10623 EMAIL Anwar O Nunez-Elizalde EMAIL Manish Gupta Microsoft Research Hyderabad, India 500032 EMAIL Fatma Deniz Technical University of Berlin Berlin, Germany 10623 EMAIL
Pseudocode	No	The paper describes the methodology in prose and refers to Figure 1 for a 'Brain-informed fine-tuning pipeline', which is a diagram, not pseudocode. No explicit 'Pseudocode' or 'Algorithm' sections or blocks are found.
Open Source Code	Yes	We make our code publicly available. 2https://github.com/denizenslab/brain-informed-fine-tuning
Open Datasets	Yes	In this study, we use brain recordings of bilingual participants reading the same naturalistic stories in English and Chinese, from (Chen et al., 2024b)... We compare models fine-tuned with f MRI data from three English-monolingual participants (all male; one from Deniz et al. (2019) and two from Le Bel et al. (2023))... For English, we used 9 tasks from the GLUE benchmark (Wang et al., 2018), and for Chinese, we used 7 tasks from the CLUE benchmark (Xu et al., 2020)... we additionally evaluated multilingual models on 3 tasks from the XGLUE benchmark (Liang et al., 2020) and 3 tasks from the XTREME benchmark (Hu et al., 2020).
Dataset Splits	Yes	Of the 11 stories, seven were used for fine-tuning the language models (covering 2756 TRs 3), three were used for training voxelwise encoding models (1117 TRs), and one story was reserved for testing (291 TRs). The same set of stories were used in each language to ensure comparability across languages.
Hardware Specification	Yes	All brain-informed fine-tuning experiments were conducted on a machine equipped with an NVIDIA TITAN RTX GPU (24 GB RAM), and NVIDIA RTX A6000 GPU (40GB RAM). The downstream NLP tasks were evaluated on machines with the same GPU configuration.
Software Dependencies	No	The paper mentions 'Hugging Face (Wolf et al., 2020)' for pretrained models, 'Adam W optimizer (Loshchilov & Hutter, 2017)', and a 'Reduce LROn Plateau scheduler'. However, it does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or tools. For example, it doesn't state 'PyTorch 1.9' or 'transformers 4.2.0'.
Experiment Setup	Yes	Training protocol. We fine-tuned the models using the Adam W optimizer (Loshchilov & Hutter, 2017) with a learning rate of 1e-4 and weight decay of 1e-3 for 30 epochs with a batch size of 32. A Reduce LROn Plateau scheduler was used to adjust the learning rate based on validation loss. We used mixed-precision training for computational efficiency and applied early stopping with a patience of 5 epochs based on validation performance. ... For downstream NLP task fine-tuning, we used a batch size of 64, learning rate of 2e-5, weight decay of 0.01, per-device evaluation batch size of 128, and trained for 3 epochs.