Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models
Authors: Thanh-Dat Truong, Huu-Thien Tran, Tran Son, Bhiksha Raj, Khoa Luu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed approach consistently achieves state-of-the-art (So TA) performance compared with prior LMMs on academic task-oriented and instruction-following LMM benchmarks. 4 Experiments 4.2 Comparision with State-of-the-Art Methods 4.3 Ablation Study |
| Researcher Affiliation | Academia | 1CVIU Lab, University of Arkansas, USA 2Vietnam National University, Ho Chi Minh City University of Science, Vietnam 3Carnegie Mellon University, USA EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | C Pseudocode for the training procedure of our framework The Algorithm 1 elucidates the framework of our proposed Direct-LLa VA during two-stage training process. |
| Open Source Code | No | Justification: The code will be published may the paper be accepted. |
| Open Datasets | Yes | The academic task-oriented task includes five benchmarks: Visual Question Answering V2 (VQAv2) [16], Question Answering on Image Scene Graphs (GQA) [18], Answer Visual Questions from People Who Are Blind (Viz Wiz) [17], Science Question Answering (Sci QA-IMG) [41], and Visual Reasoning based on Text in Images (Text VQA) [47]. The Instruction-Following LMM has five benchmarks: Polling-based Object Probing Evaluation for Object Hallucination (POPE) [32], Multimodal LLMs with Generative Comprehension Benchmark (SEED-Bench) [22], Comprehensive Evaluation Benchmark of LMM (MME) [59], LLa VA Benchmark in the Wild (LLa VA-Wild) [39], Integrated Capability Benchmark (MM-Vet) [62], and Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU-Val) [63]. |
| Dataset Splits | Yes | Following the standard protocol [37], we evaluate our models on two sets of benchmarks, i.e., Academic Task-oriented and Instruction-Following LMM Benchmarks. We use the training data of LLaVA v1.5 in our experiments. |
| Hardware Specification | Yes | We use 32 NVIDIA A100 in our experiments. |
| Software Dependencies | No | Our framework adopts the implementation of LLaVA v1.5 [37]. We use the CLIPVi T-L-14 (3362) encoder for the vision tower, and four different LLMs, i.e., Vicuna 7B [10], Vicuna 13B [10], Qwen 7B [3], and LLaMA3 8B [13]. |
| Experiment Setup | Yes | Our framework adopts the implementation of LLaVA v1.5 [37]. We use the CLIPVi T-L-14 (3362) encoder for the vision tower, and four different LLMs, i.e., Vicuna 7B [10], Vicuna 13B [10], Qwen 7B [3], and LLaMA3 8B [13]. We adopt the multi-layer perception [37] for the VL connector. To ensure the consistency of our implementation, the directed token drt is placed at the end of the sequence. We use 32 NVIDIA A100 in our experiments. For fair comparisons, we adopt the learning hyper-parameters of LLaVA v1.5 in our training. |