Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CamSAM2: Segment Anything Accurately in Camouflaged Videos

Authors: Yuli Zhou, Yawei Li, Yuqian Fu, Luca Benini, Ender Konukoglu, Guolei Sun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted to validate the effectiveness of our approach. While Cam SAM2 only adds negligible learnable parameters to SAM2, it substantially outperforms SAM2 on three VCOS datasets, especially achieving 12.2 m Dice gains with click prompt on Mo CA-Mask and 19.6 m Dice gains with mask prompt on SUN-SEG-Hard, with Hiera-T as the backbone.
Researcher Affiliation Academia 1ETH Zurich 2Nankai University 3University of Zurich 4INSAIT, Sofia University St. Kliment Ohridski 5University of Bologna
Pseudocode No The paper describes the architecture and methodology verbally and with diagrams (Figure 1, Figure 2) but does not include a distinct section or figure labeled "Pseudocode" or "Algorithm", nor does it present structured, code-like steps for any procedure.
Open Source Code Yes The code is available at https://github.com/zhoustan/Cam SAM2.
Open Datasets Yes Our experiments are conducted on three video datasets: two popular camouflaged animal datasets, Mo CA-Mask [4] and CAD [19], and one camouflaged medical dataset, SUN-SEG [20].
Dataset Splits Yes The dataset Mo CA-Mask is reorganized from the Mo CA, containing 71 video sequences with 19,313 frames for training and 16 video sequences with 3,626 frames for testing, respectively, with pixel-wise ground-truth masks on every five frames. It also generates a Mo CA-Mask-Pseudo dataset, which contains pseudo masks for unlabeled frames with a bidirectional optical-flow-based consistency check strategy. The Camouflaged Animal Dataset (CAD) includes 9 short videos in total that have 181 hand-labeled masks on every five frames. SUN-SEG is the largest benchmark for video polyp segmentation, derived from SUN-database [48]. It consists of a training set with 112 clips (19,544 frames) and two test sets: SUN-SEG-Easy, containing 119 clips (17,070 frames), and SUN-SEG-Hard, comprising 54 clips (12,522 frames).
Hardware Specification Yes We train Cam SAM2 on 4 NVIDIA RTX 4090 GPUs for 10 epochs.
Software Dependencies No The proposed Cam SAM2 is implemented with Py Torch [49]. While PyTorch is mentioned, a specific version number is not provided in the text.
Experiment Setup Yes Following the training strategy of SAM2, we use three types of prompts (mask, bounding box, 1-click point of foreground) for training, with the probabilities of 0.5, 0.25, and 0.25, respectively. To train the model, we use a combined loss of binary cross-entropy (BCE) and dice loss for mask predictions across the entire video. This loss applies to both SAM2 s mask logits Ri and the Cam SAM2 s mask logits Rc i, compared with the ground-truth mask Si of frame Ii, as follows: L = LC + LD, (7) ... We set betas = (0.9, 0.999) for the optimizer Adam and use the learning rate of 1e-3. We train Cam SAM2 on 4 NVIDIA RTX 4090 GPUs for 10 epochs. For camouflaged animal segmentation, we train the model using the Mo CA-Mask-Pseudo training set and evaluate it on the Mo CA-Mask test set and CAD. During inference, we apply the 1-click, box, and mask prompts only on the first frame of each video. For camouflaged polyp segmentation, we train the model using the SUN-SEG training set and perform inference using the mask prompt on the first frame of each video on the SUN-SEG-Easy and SUN-SEG-Hard test sets. During the training process, each training video clip consists of 8 frames, the input frames are resized to 1024 1024, and the ground truths are resized to 256 256 since the raw predicted logits are 1/4 of the original size. We train Cam SAM2 with a batch size of 4.