Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mantis: Interleaved Multi-Image Instruction Tuning

Authors: Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu Chen

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Mantis on 8 multi-image benchmarks and 6 single-image benchmarks. Mantis-Idefics2 can achieve So TA results on all the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by an average of 13 absolute points. Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image data, which is 200x larger than Mantis-Instruct. We observe that Mantis performs equivalently well on the held-in and held-out benchmarks, which shows its generalization ability. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis also maintains a strong single-image performance on par with Cog VLM and Emu2.
Researcher Affiliation	Collaboration	University of Waterloo, Tsinghua University, Sea AI Lab EMAIL
Pseudocode	No	The paper describes the model architecture and data curation processes but does not present any structured pseudocode or algorithm blocks. Table 2 shows a 'Propmt template used to curate the Multi-VQA subset', which is not pseudocode.
Open Source Code	Yes	We will release all the code, data, and models to help with the reproducibility of our results.
Open Datasets	Yes	Dataset: We create the first multi-image instruction-tuning dataset Mantis-Instruct. It has a total of 721K instances, consisting of 14 subsets to cover all the multi-image skills. Among the 14 subsets, 10 subsets are from the existing datasets. For example, NLVR2 (Suhr et al., 2018), Icon QA (Jhamtani & Berg-Kirkpatrick, 2018), etc are used to cover reasoning skill; Dream Sim (Fu et al., 2023), Birds-to-Words (Forbes et al., 2019), etc are used to cover comparison skill; NEx T-QA (Xiao et al., 2021), STAR (Wu & Yu, 2021), etc are used to cover temporal understanding skill.
Dataset Splits	Yes	NLVR2 (Suhr et al., 2018) ... We use the test-public split for evaluation. Qbench (Wu et al., 2023a) ... We evaluate the Qbench2-A2-pair dev set... BLINK (Fu et al., 2024) ... We report results in the validation set of the benchmark. MVBench (Li et al., 2023d) ... We report results in test split. Mantis-Eval ... We report results in the test split.
Hardware Specification	Yes	All full fine-tuning ran on 16 A100 GPUs while the ablation study ran on 8 A100 GPUs.
Software Dependencies	Yes	We speed up our training and inference with Flash-attention2 (Dao, 2023). We use Deep Speed Zero-3 (Aminabadi et al., 2022) for full fine-tuning. We apply QLo RA (Dettmers et al., 2023) along with Do RA (yang Liu et al., 2024) to more efficiently do comprehensive ablation studies under limited resources.
Experiment Setup	Yes	During the fine-tuning, we train each model on the data for 1 epoch, with a batch size of 128. The maximum context length is set to 8192. The learning rate is all set to 1e-5, except for Idefics2, where the learning rate is set to 5e-6 to better preserve its original knowledge. We set the warmup ratio to be 0.03 and use a cosine learning rate scheduler.