Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning
Authors: Lu Zhang, Jiazuo Yu, Haomiao Xiong, Ping Hu, Yunzhi Zhuge, Huchuan Lu, You He
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on FINERS-4k and public datasets demonstrate that our method consistently outperforms state-of-the-art MLLM-based approaches on both instruction-guided segmentation and visual reasoning tasks. |
| Researcher Affiliation | Academia | 1Dalian University of Technology, Dalian, China 2University of Electronic Science and Technology of China, Chengdu, China 3Tsinghua Shenzhen International Graduate School, Shenzhen, China |
| Pseudocode | No | The paper describes the methodology using textual explanations and figures, but no explicitly labeled pseudocode or algorithm blocks are provided. |
| Open Source Code | No | We have included the implementation details and will provide the code, models, and dataset once publication. |
| Open Datasets | Yes | Extensive experiments on FINERS-4k and other public datasets [9, 10] demonstrate that FINERS consistently outperforms existing MLLM-based methods in both answer accuracy and segmentation precision. ... We also conduct a comparison with other approaches on public high-resolution VQA datasets, including V* [9] and HR-Bench [10]. |
| Dataset Splits | Yes | This process results in 8,411 annotated small entities across 4,563 high-resolution images, yielding a total of 12,132 text-mask pairs. Specifically, we divide them into train set (8,956), validation set (749), and test set (2,427). |
| Hardware Specification | Yes | The whole model is trained on a 4 A800 GPU (80G) setup using the Seg-Zero [16] and Deep Speed [12] library. ... All models were evaluated on a single A100 GPU with consistent runtime environments. |
| Software Dependencies | Yes | Our two-stage MLLMs are built upon Qwen2.5-VL-7B [2]... In addition, we adopt SAM2 [8] for box-to-mask generation, which is kept frozen during training. |
| Experiment Setup | Yes | The GSE module uses a total batch size of 16 with 8 samples per training step, while the LPR module uses a total batch size of 32, also with 8 samples per step. For both stages, the initial learning rate is set to 1e-6 and the weight decay is 0.01. |