Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object Scenarios
Authors: Jiacheng Ruan, Wenzhen Yuan, Zehao Lin, Ning Liao, Zhiyu Li, Feiyu Xiong, Ting Liu, Yuzhuo Fu
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted on the Cam Obj Bench with Cam Obj-Llava, 8 existing open-source and 3 closed-source LVLMs. Surprisingly, the results indicate that our model achieves a 25.84% improvement in 4 out of 7 tasks compared to GPT-4o. |
| Researcher Affiliation | Academia | 1 Shanghai Jiao Tong University, Shanghai, China 2 Institute for Advanced Algorithms Research, Shanghai, China EMAIL |
| Pseudocode | No | No structured pseudocode or algorithm blocks are present in the paper. Methodologies are described in narrative text. |
| Open Source Code | Yes | Code https://github.com/JCruan519/MM-Cam Obj |
| Open Datasets | Yes | In our study, the images in the MM-Cam Obj dataset are sourced from publicly available camouflage scene understanding datasets. These datasets not only provide accurate category annotations but also include segmentation masks for camouflaged objects. Specifically, as shown in Table 1, we carefully select 11,963 camouflaged target images from (Pang et al. 2023; Fan et al. 2020; Cheng et al. 2022; Yang 2023; Zheng et al. 2018). |
| Dataset Splits | Yes | Of these, 11,363 images are used to construct Cam Obj-Align and Cam Obj-Instruct, while the remaining 600 images are utilized to construct Cam Obj-Bench... To validate the performance of our Cam Obj-Llava-7B with limited data, we randomly selected 10%, 20%, and 50% of the samples from the Cam Obj-Align and Cam Obj-Instruct datasets for training, while keeping the rest of the experimental settings consistent with those described in Sec. Training Details. |
| Hardware Specification | Yes | All experiments on conducted on 8 NVIDIA A800 GPUs. |
| Software Dependencies | Yes | We utilize BGE-M3 (Chen et al. 2023) and BGE-v1.5-en (Xiao et al. 2023) to obtain embeddings for the image and text, calculating the cosine similarity between them. |
| Experiment Setup | Yes | During the alignment stage, we set the learning rate to 5e-4, and for the instruction fine-tuning stage, the learning rate is adjusted to 2.5e-4. Both stages are trained for 1 epoch using the Adam W (Loshchilov and Hutter 2017) optimizer, with a cosine decay strategy (Loshchilov and Hutter 2016) to dynamically adjust the learning rate. Specifically, we apply Lo RA (Hu et al. 2021) modules to each linear layer of the large language model, with the rank r and scaling factor α set to 128 and 256. |