Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models

Authors: Ye Sun, Hao Helen Zhang, Henghui Ding, Tiehua Zhang, Xingjun Ma, Yu-Gang Jiang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments and benchmarking results show that SAMA not only achieves strong performance on SAMA-Bench but also sets a new state-of-the-art on general grounding benchmarks, while maintaining highly competitive performance on standard visual understanding benchmarks.
Researcher Affiliation Academia Ye Sun1, Hao Zhang2, Henghui Ding1 , Tiehua Zhang3, Xingjun Ma1 , Yu-Gang Jiang1 1Fudan University, 2HKUST, 3Tongji University
Pseudocode No The paper includes a model architecture diagram (Figure 3) but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/sunye23/SAMA
Open Datasets Yes SAMA is trained on a diverse set of image/video QA and referring segmentation/grounding datasets, including LLa VA-1.5 665K [38], the Chat Uni Vi [21] dataset, image-based referring/grounding data (ref COCO/+/g [23, 43], GRand-F [49]), and video referring segmentation datasets (Ref-You Tube-VOS [53], Me Vi S [9], and Re VOS [61]). Critically, training is enhanced with our proposed SAMA-239K dataset, comprising 239K instances of referential grounded dialogue and object-level descriptions.
Dataset Splits Yes SAMA-Bench comprises 5,067 questions synthesized from 522 videos across four public validation datasets: Me Vi S [9], Ref You Tube-VOS [53], LVVIS [58], and Vid STG [74]... As shown in Table 6, SAMA-Bench G consists of 3,038 video referential grounded chat questions, with 244 from Me Vi S, 756 from Ref-You Tube-VOS, 1,019 from LV-VIS, and 1,019 from Vid STG. Similarly, Table 7 summarizes SAMA-Bench C, which comprises 2,031 video referential captioning questions, sourced from the same set of videos: 117 from Me Vi S, 350 from Ref-You Tube VOS, 589 from LV-VIS, and 975 from Vid STG.
Hardware Specification Yes Model training and inference is conducted on 8 NVIDIA A100 GPUs (80GB).
Software Dependencies No We implement SAMA leveraging the XTuner [7] codebase. The paper does not provide specific version numbers for XTuner or other key software components like Python or PyTorch.
Experiment Setup Yes During the instruction tuning phase, we make the parameters of our spatial-temporal-context aggregator and the decoder of the SAM2 model [50] trainable to learn spatial-temporal information and inject referential grounded video chat capability into the base model. The initial learning rate is set to 4e-5. The maximum sequence length for the LLM is configured to 8,192.