Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation

Authors: Jitesh Jain, Zhengyuan Yang, Humphrey Shi, Jianfeng Gao, Jianwei Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that our Vis Per-LM outperforms the single and multi-encoder baselines, proving our approach s superiority over explicitly feeding the corresponding features to the LLM. In particular, Vis Per-LM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench. Our experiments in Sec. 5 illustrate the effectiveness of our approach on various benchmarks while outperforming the baselines.
Researcher Affiliation Collaboration 1SHI Labs @ Georgia Tech 2Microsoft Research, Redmond 3Meta Superintelligence Labs
Pseudocode No The paper includes architectural diagrams (Figure 1, Figure 4) and textual descriptions of methods, but no explicit section or figure labeled 'Pseudocode' or 'Algorithm' with structured, code-like steps.
Open Source Code No Answer: [Yes] for data; [No] for code. Justification: We use publicly available datasets which can be easily downloaded from the web. Our codebase is built on top of the LLa VA-1.5 [45] codebase, and we will release our code with the camera-ready version.
Open Datasets Yes Answer: [Yes] for data; [No] for code. Justification: We use publicly available datasets which can be easily downloaded from the web. During the PT stage, we use the LLa VA-558K [45] dataset to train our model... During the IFT stage, we use the LLa VA-665K [45] dataset... When using VPT, we leverage the ALLa VA-Caption-663K [6] dataset... We train the probes for two epochs on the 118k images from the COCO-train2017 [43] set with the text query: Describe the image in two lines. . We find a positive correlation of 0.98 between the depth probing performance and CV-Bench accuracy in Tab. X, proving the effectiveness of our probing setup. We compute the cosine similarity between the probe outputs and the corresponding target features over the 5k images from the COCO-val2017 [43] set to get the probing performance for evaluation.
Dataset Splits Yes We train the probes for two epochs on the 118k images from the COCO-train2017 [43] set... We compute the cosine similarity between the probe outputs and the corresponding target features over the 5k images from the COCO-val2017 [43] set... During the PT stage, we use the LLa VA-558K [45] dataset... During the IFT stage, we use the LLa VA-665K [45] dataset... When using VPT, we leverage the ALLa VA-Caption-663K [6] dataset...
Hardware Specification Yes We train all our models on 16 AMD 192G-MI300X GPUs with a batch size of 256 during PT and 128 during IFT and VPT. We record the throughput on a single NVIDIA 80G A100 GPU for a single forward pass on the CV-Bench evaluation set with a batch size of 1. We report the mean and standard deviation across 10 runs. We use the CLIP-Conv Ne XT-XXL [14] and Llama3-8b [64] based models for the throughput analysis.
Software Dependencies No Our codebase is built on top of the LLa VA-1.5 [45] codebase, and we will release our code with the camera-ready version. The paper mentions several models and frameworks like Llama3-8b [64], CLIP-ViT-L [55], DINOv2-L [53], Swin-L [49], unCLIP-SD-2.1 [57], but does not provide specific version numbers for ancillary software like Python, PyTorch, or CUDA.
Experiment Setup Yes During the PT stage, we use the LLa VA-558K [45] dataset to train our model for an epoch with lr of 1e 3. We only train the (MLP) projector, the embedding predictors, and the special tokens ( t ). During the IFT stage, we use the LLa VA-665K [45] dataset and train the projector and LLM for one epoch with an lr of 2e 5 with the vision encoder and t kept frozen. When using VPT, we leverage the ALLa VA-Caption-663K [6] dataset to train the whole model (except t ) for one epoch with an lr of 2e 5. We train all our models on 16 AMD 192G-MI300X GPUs with a batch size of 256 during PT and 128 during IFT and VPT. We use CLIP-Vi T-L [55] and Llama3-8b [64] as the base vision encoder and decoder LLM unless mentioned otherwise. By default, we set N, D, S, and G to 8, {8, 20}, {10, 18}, and {12, 20}, respectively. For other hyperparameters, we follow LLa VA-1.5 [45].