Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Authors: Brandon Huang, Chancharik Mitra, Leonid Karlinsky, Assaf Arbelle, Trevor Darrell, Roei Herzig

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference.
Researcher Affiliation Collaboration 1 University of California, Berkeley 2 IBM Research 3 MIT-IBM Watson AI Lab
Pseudocode Yes Algorithm 1 MTV-EXTRACT for finding task vector locations
Open Source Code Yes Code: https://github.com/Brandon3964/Multi Modal-Task-Vector
Open Datasets Yes We use the following commonly-evaluated datasets which emphasize different aspects of multimodal reasoning, including visual features (Viz Wiz) and outside knowledge (OK-VQA): (1) Viz Wiz [23]... (2) OK-VQA dataset [54]... (1) The Flowers dataset [59]... (2) Caltech s CUB Dataset on Birds [82]
Dataset Splits Yes The task vector is extracted using examples from the train set of the dataset and evaluated on the validation set.
Hardware Specification Yes all our experiments can be run on a single NVIDIA A6000 GPU.
Software Dependencies No The paper mentions 'PyTorch [61]' but does not provide a specific version number for PyTorch or any other software libraries used in the implementation.
Experiment Setup Yes For VQA, we show the results of MTV with 4 shots per 100 iterations to calculate the mean activations and 100 examples for task vector locations (500 examples total). For object classification, we extract MTV based on a 2-way, one-shot regimen per 100 iterations for both mean activations and task vector locations (200 examples total).