Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Hawaii: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models
Authors: Yimu Wang, Mozhgan Nasr Azadani, Sean Sedwards, Krzysztof Czarnecki
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on various vision-language tasks demonstrate the superiority of HAWAII compared to popular open-source VLMs. The results are shown in Table 1. Compared to the baseline method (LLa VA-1.5), HAWAII achieves significant improvements on most benchmarks, demonstrating its effectiveness. In this part, we conduct ablation studies to analyze the effectiveness of the proposed components in HAWAII. |
| Researcher Affiliation | Academia | Yimu Wang, Mozhgan Nasr Azadani, Sean Sedwards, Krzysztof Czarnecki University of Waterloo, Canada EMAIL |
| Pseudocode | No | The paper describes the methods textually and visually through Figure 1 but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | The code is available at https://github.com/yimuwangcs/wise-hawaii. |
| Open Datasets | Yes | We evaluate HAWAII on several image understanding tasks [36, 37, 38, 39, 40, 41, 42, 43, 44, 45]. Details are deferred to the Appendix. MME [41], MMBench [42], Seed Bench [45], GQA [38], SQA [39], MMMU [43], POPE [40], AI2D [44], Viz Wiz [36], and Text VQA [37]. |
| Dataset Splits | Yes | HAWAII uses the same training data as LLa VA-v1.5 [46]. Specifically, in the pretraining stage, we use 558K image-text pairs, while in the supervised fine-tuning stage, we use 665K instruction-following image-text data to boost the performance. |
| Hardware Specification | Yes | Models are run on eight NVIDIA A6000 GPUs with 48GB of memory. |
| Software Dependencies | No | The paper mentions specific versions of models used like 'Vicuna-v1.5-7B', 'CLIP', 'Conv Ne Xt', 'Pix2Struct', 'SAM', and 'EVA-02'. However, it does not provide specific version numbers for general software dependencies such as programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch), or CUDA libraries, which are typically required for full reproducibility. |
| Experiment Setup | Yes | We use Vicuna-v1.5-7B [3] as the LLM and use CLIP [1] for the vision encoder, with the teachers of CLIP (as CLIP is updated) being Conv Ne Xt [34], Pix2Struct [35], SAM [6], and EVA-02 [8]. The visual projector is a 2-layer MLP with the GELU activation function [56]. For MOLA, we use three (or four) teacher-specific Lo RA adapters and three general-knowledge Lo RA adapters for each FFN layer of the student encoder. Each adapter is a Lo RA block [30] with rank of 32. The routers are sparse and 2-layer MLPs with the GELU activation function. Each router selects only the Lo RA adapter with the highest probability. L = Lgen + λ1(Lfg + Lcg) + λ2Lmb, where λ1 and λ2 are the hyper-parameters to balance the losses. We set λ1 = 0.5 and λ2 = 0.05 for all our experiments. The training of HAWAII consists of two stages, i.e., pretraining and fine-tuning. |