Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning

Authors: Lu Zhang, Jiazuo Yu, Haomiao Xiong, Ping Hu, Yunzhi Zhuge, Huchuan Lu, You He

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on FINERS-4k and public datasets demonstrate that our method consistently outperforms state-of-the-art MLLM-based approaches on both instruction-guided segmentation and visual reasoning tasks.
Researcher Affiliation	Academia	1Dalian University of Technology, Dalian, China 2University of Electronic Science and Technology of China, Chengdu, China 3Tsinghua Shenzhen International Graduate School, Shenzhen, China
Pseudocode	No	The paper describes the methodology using textual explanations and figures, but no explicitly labeled pseudocode or algorithm blocks are provided.
Open Source Code	No	We have included the implementation details and will provide the code, models, and dataset once publication.
Open Datasets	Yes	Extensive experiments on FINERS-4k and other public datasets [9, 10] demonstrate that FINERS consistently outperforms existing MLLM-based methods in both answer accuracy and segmentation precision. ... We also conduct a comparison with other approaches on public high-resolution VQA datasets, including V* [9] and HR-Bench [10].
Dataset Splits	Yes	This process results in 8,411 annotated small entities across 4,563 high-resolution images, yielding a total of 12,132 text-mask pairs. Specifically, we divide them into train set (8,956), validation set (749), and test set (2,427).
Hardware Specification	Yes	The whole model is trained on a 4 A800 GPU (80G) setup using the Seg-Zero [16] and Deep Speed [12] library. ... All models were evaluated on a single A100 GPU with consistent runtime environments.
Software Dependencies	Yes	Our two-stage MLLMs are built upon Qwen2.5-VL-7B [2]... In addition, we adopt SAM2 [8] for box-to-mask generation, which is kept frozen during training.
Experiment Setup	Yes	The GSE module uses a total batch size of 16 with 8 samples per training step, while the LPR module uses a total batch size of 32, also with 8 samples per step. For both stages, the initial learning rate is set to 1e-6 and the weight decay is 0.01.