Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object Scenarios

Authors: Jiacheng Ruan, Wenzhen Yuan, Zehao Lin, Ning Liao, Zhiyu Li, Feiyu Xiong, Ting Liu, Yuzhuo Fu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted on the Cam Obj Bench with Cam Obj-Llava, 8 existing open-source and 3 closed-source LVLMs. Surprisingly, the results indicate that our model achieves a 25.84% improvement in 4 out of 7 tasks compared to GPT-4o.
Researcher Affiliation Academia 1 Shanghai Jiao Tong University, Shanghai, China 2 Institute for Advanced Algorithms Research, Shanghai, China EMAIL
Pseudocode No No structured pseudocode or algorithm blocks are present in the paper. Methodologies are described in narrative text.
Open Source Code Yes Code https://github.com/JCruan519/MM-Cam Obj
Open Datasets Yes In our study, the images in the MM-Cam Obj dataset are sourced from publicly available camouflage scene understanding datasets. These datasets not only provide accurate category annotations but also include segmentation masks for camouflaged objects. Specifically, as shown in Table 1, we carefully select 11,963 camouflaged target images from (Pang et al. 2023; Fan et al. 2020; Cheng et al. 2022; Yang 2023; Zheng et al. 2018).
Dataset Splits Yes Of these, 11,363 images are used to construct Cam Obj-Align and Cam Obj-Instruct, while the remaining 600 images are utilized to construct Cam Obj-Bench... To validate the performance of our Cam Obj-Llava-7B with limited data, we randomly selected 10%, 20%, and 50% of the samples from the Cam Obj-Align and Cam Obj-Instruct datasets for training, while keeping the rest of the experimental settings consistent with those described in Sec. Training Details.
Hardware Specification Yes All experiments on conducted on 8 NVIDIA A800 GPUs.
Software Dependencies Yes We utilize BGE-M3 (Chen et al. 2023) and BGE-v1.5-en (Xiao et al. 2023) to obtain embeddings for the image and text, calculating the cosine similarity between them.
Experiment Setup Yes During the alignment stage, we set the learning rate to 5e-4, and for the instruction fine-tuning stage, the learning rate is adjusted to 2.5e-4. Both stages are trained for 1 epoch using the Adam W (Loshchilov and Hutter 2017) optimizer, with a cosine decay strategy (Loshchilov and Hutter 2016) to dynamically adjust the learning rate. Specifically, we apply Lo RA (Hu et al. 2021) modules to each linear layer of the large language model, with the rank r and scaling factor α set to 128 and 256.