Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Authors: Brandon Huang, Chancharik Mitra, Leonid Karlinsky, Assaf Arbelle, Trevor Darrell, Roei Herzig
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference. |
| Researcher Affiliation | Collaboration | 1 University of California, Berkeley 2 IBM Research 3 MIT-IBM Watson AI Lab |
| Pseudocode | Yes | Algorithm 1 MTV-EXTRACT for finding task vector locations |
| Open Source Code | Yes | Code: https://github.com/Brandon3964/Multi Modal-Task-Vector |
| Open Datasets | Yes | We use the following commonly-evaluated datasets which emphasize different aspects of multimodal reasoning, including visual features (Viz Wiz) and outside knowledge (OK-VQA): (1) Viz Wiz [23]... (2) OK-VQA dataset [54]... (1) The Flowers dataset [59]... (2) Caltech s CUB Dataset on Birds [82] |
| Dataset Splits | Yes | The task vector is extracted using examples from the train set of the dataset and evaluated on the validation set. |
| Hardware Specification | Yes | all our experiments can be run on a single NVIDIA A6000 GPU. |
| Software Dependencies | No | The paper mentions 'PyTorch [61]' but does not provide a specific version number for PyTorch or any other software libraries used in the implementation. |
| Experiment Setup | Yes | For VQA, we show the results of MTV with 4 shots per 100 iterations to calculate the mean activations and 100 examples for task vector locations (500 examples total). For object classification, we extract MTV based on a 2-way, one-shot regimen per 100 iterations for both mean activations and task vector locations (200 examples total). |