Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation

Authors: Claudia Cuttano, Gabriele Trivigno, Giuseppe Averta, Carlo Masone

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments Implementation details. We employ SAM2 with Hiera-Large [58] as encoder. Adapt Former [14] is inserted into the last two blocks, with hidden size set to 0.3 the block channel dimension in the strict few-shot setting and 0.8 in the generalist. SAM2 is frozen, and only the adapters are trained ( 10M params in the strict case and 25M in the generalist). We train with Adam W and learning rate 10 4 for 5 epochs (strict) and 20 (generalist), with k=1 (a single annotated reference) and sequence length J=3. The same model is evaluated on 1-shot and 5-shot. Full details are in Appendix H. Datasets. COCO-20i [48] is built on MSCOCO [39] and consists of 80 classes split into four folds, each with 20 classes. FSS-1000 [23] contains 1000 classes, with 520 for training, 240 for validation, and 240 for testing. LVIS-92i [42] is more challenging, selecting 920 classes from LVIS [24], divided in 10 folds. PASCAL-Part [42] includes four superclasses with 56 object parts across 15 classes. PACO-Part is built from PACO [53] and contains 303 classes, split in four folds.
Researcher Affiliation Academia Politecnico di Torino {name.surname}@polito.it
Pseudocode No No pseudocode or algorithm blocks are present in the paper. The methodology is described using text and mathematical equations in sections such as "3.1 From Object Tracking to Semantic Tracking with SAM2" and "3.3 Training objective".
Open Source Code Yes Code at: https://github.com/Claudia Cuttano/SANSA.
Open Datasets Yes Datasets. COCO-20i [48] is built on MSCOCO [39] and consists of 80 classes split into four folds, each with 20 classes. FSS-1000 [23] contains 1000 classes, with 520 for training, 240 for validation, and 240 for testing. LVIS-92i [42] is more challenging, selecting 920 classes from LVIS [24], divided in 10 folds. PASCAL-Part [42] includes four superclasses with 56 object parts across 15 classes. PACO-Part is built from PACO [53] and contains 303 classes, split in four folds.
Dataset Splits Yes Datasets. COCO-20i [48] is built on MSCOCO [39] and consists of 80 classes split into four folds, each with 20 classes. FSS-1000 [23] contains 1000 classes, with 520 for training, 240 for validation, and 240 for testing. LVIS-92i [42] is more challenging, selecting 920 classes from LVIS [24], divided in 10 folds. PASCAL-Part [42] includes four superclasses with 56 object parts across 15 classes. PACO-Part is built from PACO [53] and contains 303 classes, split in four folds.
Hardware Specification Yes Experiments conducted on an NVIDIA RTX 4090. ... FPS are measured on an NVIDIA RTX 4090. ... We train with batch size 32 on 8 A100 GPUs.
Software Dependencies No No specific version numbers for software dependencies (like Python, PyTorch, or CUDA) are provided. The paper mentions using "Adam W" as an optimizer, and model architectures like "Hiera-Large [58]" and "Adapt Former [14]".
Experiment Setup Yes Implementation details. We employ SAM2 with Hiera-Large [58] as encoder. Adapt Former [14] is inserted into the last two blocks, with hidden size set to 0.3 the block channel dimension in the strict few-shot setting and 0.8 in the generalist. SAM2 is frozen, and only the adapters are trained ( 10M params in the strict case and 25M in the generalist). We train with Adam W and learning rate 10 4 for 5 epochs (strict) and 20 (generalist), with k=1 (a single annotated reference) and sequence length J=3. The same model is evaluated on 1-shot and 5-shot. Full details are in Appendix H. ... We train with Adam W (learning rate 10 4) and gradient clipping to 1.0. We train for 5 epochs in the strict setting and 20 epochs in the generalist setting, with a training sequence length of J=3. Supervision combines Binary Cross Entropy loss and Dice loss, equally weighted (1.0), applied to the predicted segmentation mask and ground truth. We train with batch size 32 on 8 A100 GPUs.