Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation

Authors: Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, DAE SHIK KIM

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on MMR show that the proposed method can reason effectively in multi-target and multi-granularity scenarios, while the existing reasoning segmentation model still has room for improvement. Experimental results on MMR and other benchmarks show that M2SA outperforms state-of-the-art methods, validating the effectiveness of its components.
Researcher Affiliation Academia Donggon Jang Yucheol Cho Suin Lee Taehyeon Kim Dae-Shik Kim Department of Electrical Engineering, KAIST EMAIL
Pseudocode No The paper describes the M2SA framework and its model architecture in Section 4 and Figure 4, but it does not contain explicit pseudocode or algorithm blocks.
Open Source Code No The abstract states that "The dataset is available at https://github.com/jdg900/MMR.". Section A.3 mentions "We utilize the released code from LISA (Lai et al., 2023) for the baseline model code construction. Since LISA follows Apache License 2.0, our code is also licensed under Apache License 2.0." However, there is no explicit statement or direct link provided for the open-source code of the M2SA model itself, only for the dataset.
Open Datasets Yes The dataset is available at https://github.com/jdg900/MMR. We collect image and mask annotations from the publicly available PACO-LVIS dataset (Ramanathan et al., 2023).
Dataset Splits Yes The entire dataset is split into distinct sets for training (154,127 pairs), validation (8,194 pairs), and test (32,077 pairs). Moreover, the test set is further categorized into three subsets: object-only, part-only, and mixed sets
Hardware Specification Yes All experiments are conducted using 4 NVIDIA RTX A6000 GPUs.
Software Dependencies No The paper mentions using pre-trained models like LLaVA-7B, LLaVA-Llama2-13B, CLIP-ViT-L/14, Vicuna-7B, Llama2-13B, SAM-ViT-H, LoRA, and AdamW optimizer, but it does not provide specific version numbers for these software components or underlying frameworks like PyTorch or CUDA.
Experiment Setup Yes We use pre-trained LLaVA-7B ... and LLaVA-Llama2-13B with CLIP-ViT-L/14 ... and Vicuna-7B .../Llama2-13B ... to form Multimodal Large Language Model (MLLM). We adopt the pre-trained SAM-ViT-H ... for the segmentation model. ... Our model is trained for 10 epochs, with each epoch consisting of 5,000 steps. We employ the AdamW ... optimizer with a learning rate of 0.0003 and set gradient accumulation to 10 steps per update. Additionally, we use Warmup Decay LR as the learning rate scheduler. The learning rate is linearly decayed after 100 steps. The batch size and LoRA rank are set to 2 and 8, respectively. ... We sample the data from the mixed training dataset in a ratio of 2:9:2:6...