reproducibilityindex.ai

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Authors: Brandon Huang, Chancharik Mitra, Leonid Karlinsky, Assaf Arbelle, Trevor Darrell, Roei Herzig

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference.
Researcher Affiliation	Collaboration	1 University of California, Berkeley 2 IBM Research 3 MIT-IBM Watson AI Lab
Pseudocode	Yes	Algorithm 1 MTV-EXTRACT for finding task vector locations
Open Source Code	Yes	Code: https://github.com/Brandon3964/Multi Modal-Task-Vector
Open Datasets	Yes	We use the following commonly-evaluated datasets which emphasize different aspects of multimodal reasoning, including visual features (Viz Wiz) and outside knowledge (OK-VQA): (1) Viz Wiz [23]... (2) OK-VQA dataset [54]... (1) The Flowers dataset [59]... (2) Caltech s CUB Dataset on Birds [82]
Dataset Splits	Yes	The task vector is extracted using examples from the train set of the dataset and evaluated on the validation set.
Hardware Specification	Yes	all our experiments can be run on a single NVIDIA A6000 GPU.
Software Dependencies	No	The paper mentions 'PyTorch [61]' but does not provide a specific version number for PyTorch or any other software libraries used in the implementation.
Experiment Setup	Yes	For VQA, we show the results of MTV with 4 shots per 100 iterations to calculate the mean activations and 100 examples for task vector locations (500 examples total). For object classification, we extract MTV based on a 2-way, one-shot regimen per 100 iterations for both mean activations and task vector locations (200 examples total).