reproducibilityindex.ai

Training language models to summarize narratives improves brain alignment

Authors: Khai Loong Aw, Mariya Toneva

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the alignment of the base and booksum models with f MRI recordings of 8 participants reading a chapter of a popular book wordby-word, made publicly available by Wehbe et al. (2014a).Our main contributions are as follows: 1. In Section 4, we show that training language models for deeper narrative understanding improves alignment to human brain activity.
Researcher Affiliation	Academia	1Max Planck Institute for Software Systems 2Singapore Management University
Pseudocode	Yes	Figure 3: Left. Interpretability approach to compare Pearson correlation brain alignment for f MRI samples corresponding to various discourse features. Right. Pearson correlation averages for three discourse features. Averages were computed over 8 layers for each model, sequence lengths 20 to 500, and all 8 participants. NLP models have greater brain alignment for Characters than other discourse features. When trained to summarize narratives, the models improve their brain alignment significantly for all discourse features (paired t-test, FDR corrected for multiple comparisons). However, it improves more for Characters than other discourse features. Note that the average correlations shown here are low in magnitude as they include a large number of brain voxels that may not be significantly involved in brain-NLP alignment or language processing, as well as many layers and sequence lengths. Algorithm 1 Interpretability approach to compare brain alignment across discourse features
Open Source Code	Yes	Code available at https://github.com/awwkl/brain language summarization.
Open Datasets	Yes	We use a publicly available brain dataset (Wehbe et al., 2014a) consisting of f MRI recordings of 8 participants reading chapter 9 of the book Harry Potter and the Sorceror s Stone (Rowling et al., 1998).We specifically investigate 4 pretrained language models (i.e., base models ) and 4 corresponding models obtained by training the base models on the Book Sum dataset (Kryscinski et al., 2021) to improve the base language model s narrative understanding (i.e., booksum models ).
Dataset Splits	Yes	Since the f MRI data was collected in 4 runs of approximately equal length, we use 4-fold cross-validation where each fold corresponds to holding out one run of f MRI data for testing.First, we split our Harry Potter text dataset into a train and test set (75% and 25% of the text respectively).
Hardware Specification	No	The paper does not specify any hardware details such as GPU models, CPU types, or other compute infrastructure used for running the experiments.
Software Dependencies	No	The paper mentions using “Hugging Face” models but does not provide specific version numbers for software dependencies like Python, PyTorch, TensorFlow, or specific library versions.
Experiment Setup	Yes	We select the ridge parameter via nested cross-validation.First, we reduce the dimensionality of the word-level NLP representations R5176 d using PCA and retain the top 10 principle components (more than 75% of the variance) to result in a matrix R5176 10.