Matryoshka Query Transformer for Large Vision-Language Models

Authors: Wenbo Hu, Zi-Yi Dou, Liunian Li, Amita Kamath, Nanyun Peng, Kai-Wei Chang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Combining MQT with LLa VA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLAVA, matches LLa VA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLa VA s fixed 576.
Researcher Affiliation Academia Wenbo Hu Zi-Yi Dou Liunian Harold Li Amita Kamath Nanyun Peng Kai-Wei Chang University of California, Los Angeles {whu,zdou,liunian.harold.li,kamatha}@cs.ucla.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/MQT-LLa VA
Open Datasets Yes We evaluate our model across 11 mainstream benchmarks, including Viz Wiz (Gurari et al., 2018), Science QA-IMG (Lu et al., 2022), VQA-v2 (Goyal et al., 2017), GQA (Hudson and Manning, 2019), POPE (Li et al., 2023c), MME Perception (Fu et al., 2023), MME Cognition (Fu et al., 2023), MMBench (Liu et al., 2023c), LLa VA-Bench (In-the-Wild) (Liu et al., 2023b), and MM-Vet (Yu et al., 2024).
Dataset Splits No We train only the query transformer in the first-stage alignment, using LLa VA-558K for 1 epoch with a batch size of 256 and a learning rate of 1e-3. We then fine-tune both the query transformer and LLM using LLa VA-665K for 2 epochs with a batch size of 128 and a learning rate of 2e-5.
Hardware Specification Yes All training is on 8x A6000s, with 4 and 30 hours per stage, respectively.
Software Dependencies No The paper mentions models like LLa VA-1.5, CLIP Vi T-L/14, and Vicuna-v1.5, but does not provide specific software versions for libraries like PyTorch or CUDA.
Experiment Setup Yes We train only the query transformer in the first-stage alignment, using LLa VA-558K for 1 epoch with a batch size of 256 and a learning rate of 1e-3. We then fine-tune both the query transformer and LLM using LLa VA-665K for 2 epochs with a batch size of 128 and a learning rate of 2e-5.