reproducibilityindex.ai

Matryoshka Query Transformer for Large Vision-Language Models

Authors: Wenbo Hu, Zi-Yi Dou, Liunian Li, Amita Kamath, Nanyun Peng, Kai-Wei Chang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Combining MQT with LLa VA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLAVA, matches LLa VA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLa VA s fixed 576.
Researcher Affiliation	Academia	Wenbo Hu Zi-Yi Dou Liunian Harold Li Amita Kamath Nanyun Peng Kai-Wei Chang University of California, Los Angeles {whu,zdou,liunian.harold.li,kamatha}@cs.ucla.edu
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	https://github.com/MQT-LLa VA
Open Datasets	Yes	We evaluate our model across 11 mainstream benchmarks, including Viz Wiz (Gurari et al., 2018), Science QA-IMG (Lu et al., 2022), VQA-v2 (Goyal et al., 2017), GQA (Hudson and Manning, 2019), POPE (Li et al., 2023c), MME Perception (Fu et al., 2023), MME Cognition (Fu et al., 2023), MMBench (Liu et al., 2023c), LLa VA-Bench (In-the-Wild) (Liu et al., 2023b), and MM-Vet (Yu et al., 2024).
Dataset Splits	No	We train only the query transformer in the first-stage alignment, using LLa VA-558K for 1 epoch with a batch size of 256 and a learning rate of 1e-3. We then fine-tune both the query transformer and LLM using LLa VA-665K for 2 epochs with a batch size of 128 and a learning rate of 2e-5.
Hardware Specification	Yes	All training is on 8x A6000s, with 4 and 30 hours per stage, respectively.
Software Dependencies	No	The paper mentions models like LLa VA-1.5, CLIP Vi T-L/14, and Vicuna-v1.5, but does not provide specific software versions for libraries like PyTorch or CUDA.
Experiment Setup	Yes	We train only the query transformer in the first-stage alignment, using LLa VA-558K for 1 epoch with a batch size of 256 and a learning rate of 1e-3. We then fine-tune both the query transformer and LLM using LLa VA-665K for 2 epochs with a batch size of 128 and a learning rate of 2e-5.