reproducibilityindex.ai

Crafting Interpretable Embeddings for Language Neuroscience by Asking LLMs Questions

Authors: Vinamra Benara, Chandan Singh, John Morris, Richard Antonello, Ion Stoica, Alexander Huth, Jianfeng Gao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce question-answering embeddings (QA-Emb), embeddings where each feature represents an answer to a yes/no question asked to an LLM. Training QA-Emb reduces to selecting a set of underlying questions rather than learning model weights. We use QA-Emb to flexibly generate interpretable models for predicting f MRI voxel responses to language stimuli. QA-Emb significantly outperforms an established interpretable baseline, and does so while requiring very few questions.
Researcher Affiliation	Collaboration	Vinamra Benara* UC Berkeley Chandan Singh* Microsoft Research John X. Morris Cornell University Richard J. Antonello UT Austin Ion Stoica UC Berkeley Alexander G. Huth UT Austin Jianfeng Gao Microsoft Research
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	All code for QA-Emb is made available on Github at github.com/csinva/interpetable-embeddings.
Open Datasets	Yes	We analyze data from two recent studies [78, 79] (released under the MIT license), which contain f MRI responses for 3 human subjects listening to 20+ hours of narrative stories from podcasts. ... We take a random subset of 4,000 queries from the MSMarco dataset ([91], Creative Commons License) and their corresponding groundtruth documents, resulting in 5,210 documents.
Dataset Splits	Yes	We select the best-performing hyperparameters via cross-validation on 5 time-stratified bootstrap samples of the training set. ... We finetune the model on answers from LLa MA-3 8B with a few-shot prompt for 80% of the 10-grams in the 82 f MRI training stories (123,203 examples), use the remaining 20% as a validation set for early stopping (30,801 examples).
Hardware Specification	Yes	Experiments were run using 64 AMD MI210 GPUs, each with 64 gigabytes of memory, and reproducing all experiments in the paper requires approximately 4 days (initial explorations required roughly 5 times this amount of compute).
Software Dependencies	Yes	For answering questions, we average the answers from Mistral-7B [26] (mistralai/Mistral-7B-Instruct-v0.2) and LLa MA-3 8B [27] (meta-llama/Meta-Llama-3-8B-Instruct) with two prompts. ... For generating questions, we prompt GPT-4 [24] (gpt-4-0125-preview). ... Specifically, we finetune a Ro BERTa model [87] (roberta-base)... We run Elastic net using the Multi Task Elastic Net class from scikit-learn [80].
Experiment Setup	Yes	We select the best ridge parameters from 12 logarithmically spaced values between 10 and 10,000. To model temporal delays in the f MRI signal, we also select between adding 4, 8, or 12 time-lagged duplicates of the stimulus features. ... We perform feature selection by running multi-task Elastic net with 20 logarithmically spaced regularization parameters ranging from 10 3 to 1 and then fit a Ridge regression to the selected features. ... We finetune using Adam W [88] with a learning rate of 5 10 5.