Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding
Authors: Wei Xu, Cheng Wang, Dingkang Liang, Zongchuang Zhao, Xingyu Jiang, Peng Zhang, Xiang Bai
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments conducted on the Naut Data and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area. |
| Researcher Affiliation | Academia | 1Huazhong University of Science and Technology 2National University of Defense Technology EMAIL |
| Pseudocode | No | The paper includes architectural diagrams (Figure 2, 3, 4) and describes methodologies using text and mathematical equations, but it does not present any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Data and models are available at https://github.com/H-EmbodVis/NAUTILUS. |
| Open Datasets | Yes | To bridge this gap, we construct Naut Data, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. ... Data and models are available at https://github.com/H-EmbodVis/NAUTILUS. |
| Dataset Splits | Yes | The current Naut Data test set comprises 3, 920 images paired with 7, 916 question-answering (QA) examples. ... We evaluate the counting performance on its test set. ... Models in Tab. 5 and Tab. 6 are trained on one-third of the Naut Data training set and evaluated on the full Naut Data test set. |
| Hardware Specification | Yes | Our experiments are conducted on four NVIDIA A800-80GB GPUs, training each model for one epoch, taking around 3 days. |
| Software Dependencies | No | The paper mentions using LLa VA-1.5 and Qwen2.5-VL as baselines and their official repositories, but it does not provide specific version numbers for underlying software libraries like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For both of them, we adopt a parameter-efficient fine-tuning (PEFT) strategy [53, 29, 22], and the trainable components are the vision-to-language projector, Lo RA [22], and the vision feature enhancement module. In our instruction tuning, we preserve the default hyperparameters of LLa VA-1.5 to pursue optimal performance and ensure a fair comparison with its original implementations. As for Qwen2.5-VL, since the official repository only supports full fine-tuning, we reproduce Lo RA fine-tuning, setting the learning rate as 2 10 5 with the batch size as 16. The Lo RA ranks in LLa VA-1.5 and Qwen2.5-VL are set as 128. |