Scaling laws for language encoding models in fMRI

Authors: Richard Antonello, Aditya Vaidya, Alexander Huth

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here we test whether larger open-source models such as those from the OPT and LLa MA families are better at predicting brain responses recorded using f MRI. Mirroring scaling results from other contexts, we found that brain prediction performance scales logarithmically with model size from 125M to 30B parameter models, with 15% increased encoding performance as measured by correlation with a held-out test set across 3 subjects.
Researcher Affiliation Academia Richard J. Antonello Department of Computer Science The University of Texas at Austin rjantonello@utexas.edu Aditya R. Vaidya Department of Computer Science The University of Texas at Austin avaidya@utexas.edu Alexander G. Huth Departments of Computer Science and Neuroscience The University of Texas at Austin huth@cs.utexas.edu
Pseudocode No The paper describes methodological steps and formulas but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code.
Open Source Code Yes We have released code as well as selected precomputed features, model weights, and model predictions generated for this paper. These data are available at https://github.com/Huth Lab/encoding-model-scaling-laws.
Open Datasets Yes We used publicly available functional magnetic resonance imaging (f MRI) data collected from 3 human subjects as they listened to 20 hours of English language podcast stories over Sensimetrics S14 headphones [43, 44].
Dataset Splits Yes For every even-numbered non-embedding layer l in the Whisper model, as well as the 18th layer of the 33 billion LLa MA model, we held-out 20% of the training data and built an encoding model using the remaining 80% of the training data. This was repeated for each of 5 folds.
Hardware Specification Yes Ridge regression was performed using compute nodes with 128 cores (2 AMD EPYC 7763 64-core processors) and 256GB of RAM. ... Feature extraction from language and speech models was performed on specialized GPU nodes that were the same as the previously-described compute nodes but with 3 NVIDIA A100 40GB cards.
Software Dependencies No The paper mentions a 'quadratic program solver [47]' but does not provide specific software dependencies or library versions (e.g., Python, PyTorch, TensorFlow versions) used for the experiments.
Experiment Setup Yes First, activations for each word in the stimulus text were extracted from each layer of each LM. ... We use time delays of 2, 4, 6, and 8 seconds of the representation to generate this temporal transformation. ... For a given story, contexts were grown until they reached 512 tokens, then reset to a new context of 256 tokens.