Crafting Interpretable Embeddings for Language Neuroscience by Asking LLMs Questions
Authors: Vinamra Benara, Chandan Singh, John Morris, Richard Antonello, Ion Stoica, Alexander Huth, Jianfeng Gao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce question-answering embeddings (QA-Emb), embeddings where each feature represents an answer to a yes/no question asked to an LLM. Training QA-Emb reduces to selecting a set of underlying questions rather than learning model weights. We use QA-Emb to flexibly generate interpretable models for predicting f MRI voxel responses to language stimuli. QA-Emb significantly outperforms an established interpretable baseline, and does so while requiring very few questions. |
| Researcher Affiliation | Collaboration | Vinamra Benara* UC Berkeley Chandan Singh* Microsoft Research John X. Morris Cornell University Richard J. Antonello UT Austin Ion Stoica UC Berkeley Alexander G. Huth UT Austin Jianfeng Gao Microsoft Research |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | All code for QA-Emb is made available on Github at github.com/csinva/interpetable-embeddings. |
| Open Datasets | Yes | We analyze data from two recent studies [78, 79] (released under the MIT license), which contain f MRI responses for 3 human subjects listening to 20+ hours of narrative stories from podcasts. ... We take a random subset of 4,000 queries from the MSMarco dataset ([91], Creative Commons License) and their corresponding groundtruth documents, resulting in 5,210 documents. |
| Dataset Splits | Yes | We select the best-performing hyperparameters via cross-validation on 5 time-stratified bootstrap samples of the training set. ... We finetune the model on answers from LLa MA-3 8B with a few-shot prompt for 80% of the 10-grams in the 82 f MRI training stories (123,203 examples), use the remaining 20% as a validation set for early stopping (30,801 examples). |
| Hardware Specification | Yes | Experiments were run using 64 AMD MI210 GPUs, each with 64 gigabytes of memory, and reproducing all experiments in the paper requires approximately 4 days (initial explorations required roughly 5 times this amount of compute). |
| Software Dependencies | Yes | For answering questions, we average the answers from Mistral-7B [26] (mistralai/Mistral-7B-Instruct-v0.2) and LLa MA-3 8B [27] (meta-llama/Meta-Llama-3-8B-Instruct) with two prompts. ... For generating questions, we prompt GPT-4 [24] (gpt-4-0125-preview). ... Specifically, we finetune a Ro BERTa model [87] (roberta-base)... We run Elastic net using the Multi Task Elastic Net class from scikit-learn [80]. |
| Experiment Setup | Yes | We select the best ridge parameters from 12 logarithmically spaced values between 10 and 10,000. To model temporal delays in the f MRI signal, we also select between adding 4, 8, or 12 time-lagged duplicates of the stimulus features. ... We perform feature selection by running multi-task Elastic net with 20 logarithmically spaced regularization parameters ranging from 10 3 to 1 and then fit a Ridge regression to the selected features. ... We finetune using Adam W [88] with a learning rate of 5 10 5. |