Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Modality-Specialized Synergizers for Interleaved Vision-Language Generalists
Authors: Zhiyang Xu, Minqian Liu, Ying Shen, Joy Rimchala, Jiaxin Zhang, Qifan Wang, Yu Cheng, Lifu Huang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that VLGs integrated with MOSS achieve state-of-the-art performance, significantly surpassing baseline VLGs in complex interleaved generation tasks. Furthermore, our method exhibits strong generalizability on different VLGs. To validate the effectiveness and generalizability of our method and dataset, we adopt our method on two different VLG backbones with discrete and continuous image token spaces, and conduct extensive experiments on multiple datasets. |
| Researcher Affiliation | Collaboration | 1Virginia Tech 2Intuit AI Research 3Meta AI 4The Chinese University of Hong Kong 5University of California, Davis |
| Pseudocode | No | The paper describes the methodology in prose and uses diagrams (e.g., Figure 2) to illustrate concepts, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We publicly released the code, model checkpoints, and dataset at https://github.com/VT-NLP/MoSS. |
| Open Datasets | Yes | Additionally, to improve VLG s instruction-following capabilities under diverse interleaved generation scenarios, we introduce LEAFINSTRUCT, the first open-sourced high-quality interleaved instruction tuning data with 184,982 instances spanning more than 10 domains. We evaluate the interleaved generation capability of our method on Interleaved Bench (Liu et al., 2024b). We construct a diverse instruction-tuning data collection from large-scale web resources and academic datasets, including MMDialog (Feng et al., 2023), VIST (Huang et al., 2016), Wiki Web2M (Burns et al., 2023) and You Cook2 (Zhou et al., 2018). |
| Dataset Splits | Yes | Interleaved Bench has two splits: a context-based split in which the input of each instance is equipped with interleaved text and images; and a context-free split with text-only inputs. The context-based split contains 465 instances and the text-only split contains 350 instances. We only use the contextbased split as the testing set since we mainly focus on tasks with interleaved inputs and outputs. |
| Hardware Specification | Yes | All the variants of Lo RA in Section 7, including our MOSS are trained with LEAFINSTRUCT for one epoch on 8 A100 GPUs with learning rate 2e 5, batch size 1 per GPU, and a gradient accumulation step of 16. |
| Software Dependencies | No | The paper mentions several models and frameworks used as backbones or for filtering, such as "Emu2 model," "EVA-02-CLIP-E-plus," "LLa MA-33B," "SDXL," "Llama-8B-Instruct," and "Llama3." However, it does not specify explicit version numbers for these software components or any other key libraries (e.g., Python, PyTorch, CUDA versions) that are critical for reproducibility. |
| Experiment Setup | Yes | All the variants of Lo RA in Section 7, including our MOSS are trained with LEAFINSTRUCT for one epoch on 8 A100 GPUs with learning rate 2e 5, batch size 1 per GPU, and a gradient accumulation step of 16. All the Lo RA have a rank of 256, dropout rate of 0.05, and the Lo RA α in Section 4 is set to 2 128. The kernel size of MOSS is 2 2, the stride is set to 1. During training, all parameters of the Emu2 model are kept frozen and only the Lo RA parameters are updated. |