Revealing Vision-Language Integration in the Brain with Multimodal Networks
Authors: Vighnesh Subramaniam, Colin Conwell, Christopher Wang, Gabriel Kreiman, Boris Katz, Ignacio Cases, Andrei Barbu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use (multi)modal deep neural networks (DNNs) to probe for sites of multimodal integration in the human brain by predicting stereoencephalography (SEEG) recordings taken while human subjects watched movies. We operationalize sites of multimodal integration as regions where a multimodal vision-language model predicts recordings better than unimodal language, unimodal vision, or linearly-integrated languagevision models. Our target DNN models span different architectures (e.g., convolutional networks and transformers) and multimodal training techniques (e.g., cross-attention and contrastive learning). As a key enabling step, we first demonstrate that trained vision and language models systematically outperform their randomly initialized counterparts in their ability to predict SEEG signals. We then compare unimodal and multimodal models against one another. |
| Researcher Affiliation | Academia | 1MIT CSAIL 2CBMM 3Department of Cognitive Science, Johns Hopkins University 4Boston Children s Hospital, Harvard Medical School. |
| Pseudocode | No | No pseudocode or algorithm block found. The paper describes methods in text (e.g., Section C.3 "k-fold Ridge Regression") but not in a structured pseudocode format. |
| Open Source Code | Yes | 1Code, data, and annotations to reproduce our results are available at github.com/vsubramaniam851/brain-multimodal/ |
| Open Datasets | Yes | Our primary goal in this work is to use systematic comparisons between the neural predictivity of unimodal and multimodal models to probe for sites of vision-language integration in the brain. Our work make the following contributions: ... to predict neural activity in a large-scale stereoelectroencephalography (SEEG) dataset consisting of neural responses (from intracranial electrodes) to the images (frames) and dialog of popular movies (Yaari et al., 2022). Neural Data: Invasive intracranial field potential recordings were collected during 7 sessions from 7 subjects (4 male, 3 female; aged 4 19, µ = 11.6, σ = 4.6) with pharmacologically intractable epilepsy. During each session, subjects watched a feature-length movie from the Aligned Multimodal Movie Treebank (AMMT) (Yaari et al., 2022) in a quiet room while neural activity was recorded using SEEG electrodes (Liu et al., 2009) at a rate of 2k Hz. |
| Dataset Splits | Yes | Per fold, we split our dataset of event structures contiguously based on occurrence in the movie. We place 80% of the event structures in the training set, 10% of the event structures in the validation set, and 10% in the testing set. |
| Hardware Specification | No | 28 modern GPUs on 7 machines were used for four weeks, evenly distributed across experiments. The paper does not specify the model numbers of the GPUs or any other specific hardware details like CPUs or memory. |
| Software Dependencies | No | We use the KFold function from Pedregosa et al. (2011) and implemented ridge regression in Pytorch (Paszke et al., 2019). The paper mentions PyTorch and Scikit-learn (implied by Pedregosa et al. (2011)) but does not provide specific version numbers for any software. |
| Experiment Setup | Yes | We then use these features from each layer as predictors in a 5-fold ridge regression predicting the averaged neural activity of a target neural site in response to each event structure (defined here as an image-text pair). We use the KFold function from Pedregosa et al. (2011) and implemented ridge regression in Pytorch (Paszke et al., 2019). In this analysis we run the 5-fold regression per λ value, where λ was varied using a logarithmic grid search over 10 1 to 106. |