Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

Authors: Thanh-Dat Truong, Huu-Thien Tran, Tran Son, Bhiksha Raj, Khoa Luu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The proposed approach consistently achieves state-of-the-art (So TA) performance compared with prior LMMs on academic task-oriented and instruction-following LMM benchmarks. 4 Experiments 4.2 Comparision with State-of-the-Art Methods 4.3 Ablation Study
Researcher Affiliation	Academia	1CVIU Lab, University of Arkansas, USA 2Vietnam National University, Ho Chi Minh City University of Science, Vietnam 3Carnegie Mellon University, USA EMAIL, EMAIL, EMAIL
Pseudocode	Yes	C Pseudocode for the training procedure of our framework The Algorithm 1 elucidates the framework of our proposed Direct-LLa VA during two-stage training process.
Open Source Code	No	Justification: The code will be published may the paper be accepted.
Open Datasets	Yes	The academic task-oriented task includes five benchmarks: Visual Question Answering V2 (VQAv2) [16], Question Answering on Image Scene Graphs (GQA) [18], Answer Visual Questions from People Who Are Blind (Viz Wiz) [17], Science Question Answering (Sci QA-IMG) [41], and Visual Reasoning based on Text in Images (Text VQA) [47]. The Instruction-Following LMM has five benchmarks: Polling-based Object Probing Evaluation for Object Hallucination (POPE) [32], Multimodal LLMs with Generative Comprehension Benchmark (SEED-Bench) [22], Comprehensive Evaluation Benchmark of LMM (MME) [59], LLa VA Benchmark in the Wild (LLa VA-Wild) [39], Integrated Capability Benchmark (MM-Vet) [62], and Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU-Val) [63].
Dataset Splits	Yes	Following the standard protocol [37], we evaluate our models on two sets of benchmarks, i.e., Academic Task-oriented and Instruction-Following LMM Benchmarks. We use the training data of LLaVA v1.5 in our experiments.
Hardware Specification	Yes	We use 32 NVIDIA A100 in our experiments.
Software Dependencies	No	Our framework adopts the implementation of LLaVA v1.5 [37]. We use the CLIPVi T-L-14 (3362) encoder for the vision tower, and four different LLMs, i.e., Vicuna 7B [10], Vicuna 13B [10], Qwen 7B [3], and LLaMA3 8B [13].
Experiment Setup	Yes	Our framework adopts the implementation of LLaVA v1.5 [37]. We use the CLIPVi T-L-14 (3362) encoder for the vision tower, and four different LLMs, i.e., Vicuna 7B [10], Vicuna 13B [10], Qwen 7B [3], and LLaMA3 8B [13]. We adopt the multi-layer perception [37] for the VL connector. To ensure the consistency of our implementation, the directed token drt is placed at the end of the sequence. We use 32 NVIDIA A100 in our experiments. For fair comparisons, we adopt the learning hyper-parameters of LLaVA v1.5 in our training.