Matryoshka Query Transformer for Large Vision-Language Models
Authors: Wenbo Hu, Zi-Yi Dou, Liunian Li, Amita Kamath, Nanyun Peng, Kai-Wei Chang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Combining MQT with LLa VA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLAVA, matches LLa VA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLa VA s fixed 576. |
| Researcher Affiliation | Academia | Wenbo Hu Zi-Yi Dou Liunian Harold Li Amita Kamath Nanyun Peng Kai-Wei Chang University of California, Los Angeles {whu,zdou,liunian.harold.li,kamatha}@cs.ucla.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/MQT-LLa VA |
| Open Datasets | Yes | We evaluate our model across 11 mainstream benchmarks, including Viz Wiz (Gurari et al., 2018), Science QA-IMG (Lu et al., 2022), VQA-v2 (Goyal et al., 2017), GQA (Hudson and Manning, 2019), POPE (Li et al., 2023c), MME Perception (Fu et al., 2023), MME Cognition (Fu et al., 2023), MMBench (Liu et al., 2023c), LLa VA-Bench (In-the-Wild) (Liu et al., 2023b), and MM-Vet (Yu et al., 2024). |
| Dataset Splits | No | We train only the query transformer in the first-stage alignment, using LLa VA-558K for 1 epoch with a batch size of 256 and a learning rate of 1e-3. We then fine-tune both the query transformer and LLM using LLa VA-665K for 2 epochs with a batch size of 128 and a learning rate of 2e-5. |
| Hardware Specification | Yes | All training is on 8x A6000s, with 4 and 30 hours per stage, respectively. |
| Software Dependencies | No | The paper mentions models like LLa VA-1.5, CLIP Vi T-L/14, and Vicuna-v1.5, but does not provide specific software versions for libraries like PyTorch or CUDA. |
| Experiment Setup | Yes | We train only the query transformer in the first-stage alignment, using LLa VA-558K for 1 epoch with a batch size of 256 and a learning rate of 1e-3. We then fine-tune both the query transformer and LLM using LLa VA-665K for 2 epochs with a batch size of 128 and a learning rate of 2e-5. |