Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding
Authors: Ahmed Masry, Juan Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, Sathwik Tejaswi Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque, Chris Pal, Issam Hadj Laradji, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar Mudumba
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments show that ALIGNVLM achieves state-of-the-art performance compared to prior alignment methods, with larger gains on document understanding tasks and under low-resource setups. We provide further analysis demonstrating its efficiency and robustness to noise. |
| Researcher Affiliation | Collaboration | 1Service Now 2York University 3Mila Quebec AI Institute 4École de Technologie Supérieure 5Université de Montréal 6Mc Gill University 7University of Waterloo 8CIFAR AI Chair 9Polytechnique Montréal 10University of British Columbia |
| Pseudocode | No | The paper describes the methodology and model architecture in Section 3 and details the ALIGN module mathematically (Equations 1 and 2), but does not present the steps in a structured pseudocode or algorithm block. |
| Open Source Code | Yes | We release our code and research artifacts at alignvlm.github.io. ... We will provide full access to our code upon the acceptance of the paper. |
| Open Datasets | Yes | We use the CC-12M dataset Changpinyo et al. [2021], a large-scale web dataset commonly used for VLM pretraining Liu et al. [2023b]... We leverage the Big Docs-7.5M dataset Rodriguez et al. [2024a], a curated collection of license-permissive datasets for multimodal document understanding. ... we further train it on the Doc Downstream Rodriguez et al. [2024a], Hu et al. [2024] instruction tuning dataset. ... We conduct additional experiments using ... LLa VA-Ne XT dataset Liu et al. [2024], which contains 779K samples. ... The model is trained on the LLa VA-558K image caption dataset Liu et al. [2024]. |
| Dataset Splits | No | The paper describes the datasets used for training (CC-12M, Big Docs-7.5M, Doc Downstream, LLa VA-Ne XT, LLa VA-558K) and the benchmarks used for evaluation (Doc VQA, Info VQA, Deep Form, KLC, WTQ, Tab Fact, Chart QA, Text VQA, Table VQA). While the evaluation benchmarks inherently have test splits, the paper does not explicitly specify the training/validation/test splits for the primary datasets used in its own training stages. |
| Hardware Specification | Yes | We conduct all experiments using 8 nodes of H100 GPUs, totaling 64 GPUs. |
| Software Dependencies | No | For model training, we leverage the MS-Swift framework [Zhao et al., 2024] for its flexibility. Additionally, we utilize the Deep Speed framework [Aminabadi et al., 2022], specifically the Ze RO-3 configuration, to optimize efficient parallel training across multiple nodes. The paper mentions frameworks used but does not provide specific version numbers for any software dependencies like programming languages or libraries. |
| Experiment Setup | Yes | Detailed hyperparameters are outlined in Appendix A.1. Table 6: Detailed hyperparameters for each training stage across different LLM backbones. LLM Backbone Llama 3.2-1B Llama 3.2-3B Llama 3.1-8B Stage-1 Stage-2 Stage-3 Stage-1 Stage-2 Stage-3 Stage-1 Stage-2 Stage-3 Trainable Parameters Full Model Full Model LLM & Connector Full Model Full Model LLM & Connector Full Model Full Model LLM & Connector Batch Size 512 512 512 512 256 256 512 256 256 Text Max Length 1024 2048 2048 1024 2048 2048 1024 2048 2048 Epochs 1 1 5 1 1 5 1 1 5 Learning Rate 1 10 5 5 10 5 5 10 5 1 10 5 5 10 5 5 10 5 1 10 5 1 10 5 1 10 5 |