Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Authors: Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, Aditya Grover

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted extensive evaluations of La Vi Da across a wide range of vision-language tasks. Results show that La Vi Da achieves competitive performance on most benchmarks, including MMMU [80], Math Vista [51], Chart QA [53] and Science QA [52], when compared with AR VLMs like LLa Va1.6-7B [45, 43] and Open-LLa Va-Next-Llama3-8B [15].
Researcher Affiliation Collaboration 1UCLA 2Panasonic AI Research 3Adobe Research 4Salesforce Research
Pseudocode No The paper describes training and inference algorithms in descriptive text within sections 3.2 and 3.3, but does not present a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Code and models is available at https://github.com/jacklishufan/La Vi Da
Open Datasets Yes For the pretraining phase, we use LCS-558K [46] consists of 558K image-text pairs. For the finetuning phase, we mostly use the dataset released by Open-LLa Va-Next [15]. We document the precise dataset composition of our stage-2 training in Table 5. Table 5: Compositio of Stage-2 Training Data. We report the data sources and sample sizes used to compose the Stage-2 finetuning data. Data Source Size Data Source Size Data Source Size COCO[42] 349,860 GQA[30] 72,140 Doc VQA[55] 10,211 ALLa VA-VFLAN[12] 202,966 Synthdog-En[33] 29,765 DVQA[31] 10,000 Visual Genome[35] 86,417 Text VQA[56] 21,953 SA-1B[34] 8,997 OCR VQA[56] 80,000 Chart QA [53] 18,317 LLa VA-150K [46] 2,988 Geo QA+ [13] 72,318 AI2D[32] 12,413 Wiki Art[67] 500 Share-Text VQA [14] 500 Web-Celebrity[15] 500 Web-Landmark[15] 500
Dataset Splits Yes Table 6: Evaluation Setup. We report evaluation split and generation length L used to produce results of Table 1 in the main paper. *We use a generation length of 100 for La Vi Da and a generation length of 1024 for La Vi Da-Reason. Dataset Split L Dataset Split L MME-P test 100 Math Vista* testmini_format 100 VQAv2 val 16 Math Verse* testmini_vision_dominant 100 MMBench dev 100 Math Vision* mathvision_testmini 100 MMMU val 16 Science QA scienceqa-full 16 MME-C test 16 AI2D test 16 Text VQA val 16 Chart QA test 16 Doc VQA test 32 Info VQA test 32
Hardware Specification Yes We used a mixture of A100s and A6000s for training experiments and A5000 for evaluations and inference speed benchmarks.
Software Dependencies No The paper mentions using the 'Adam W optimizer' and 'torch SDPA kernel', and the 'lmms-eval package [83]', but it does not specify version numbers for these or other key software libraries (e.g., PyTorch, Python, CUDA).
Experiment Setup Yes Training Hyperparameters. We use Adam W optimizer with a learning rate of 5e-3 with a cosine decay schedule for all experiments. For pretraining (Stage 1), we adopted a global batch size of 256 and trained for 1 epoch. For finetuning (Stage 2), we adopted a global batch size of 512 and trained for two epochs.