Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing

Authors: Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Yang Zheng, Jingyuan Li, Eli Shlizerman

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluation demonstrates that SAVVY substantially enhances performance of state-of-the-art AV-LLMs, setting a new standard and stage for approaching dynamic 3D spatial reasoning in AV-LLMs.
Researcher Affiliation	Academia	Department of Electrical & Computer Engineering, University of Washington, Seattle, USA. Department of Applied Mathematics, University of Washington, Seattle, USA Corresponding author: EMAIL
Pseudocode	Yes	Algorithm 1 Track Aggregation Algorithm for Global Map Construction
Open Source Code	No	Code and benchmark dataset will be released for upon request from the corresponding author. We also aim to release the code and the benchmark dataset as public repository on the Github upon obtaining involved approvals.
Open Datasets	No	Code and benchmark dataset will be released for upon request from the corresponding author. We also aim to release the code and the benchmark dataset as public repository on the Github upon obtaining involved approvals.
Dataset Splits	No	The paper describes SAVVY-Bench as a benchmark for evaluation of 3D spatial reasoning of AV-LLMs. It details the task taxonomy and statistics of the QA pairs within the benchmark, such as the distribution of QA task types (Figure 2a). However, it does not explicitly provide information on train/test/validation splits of the SAVVY-Bench dataset itself for model evaluation or training purposes within the paper.
Hardware Specification	Yes	For baseline AV-LLMs ( 7B scale), we use a single A100 (40GB) to perform experiments. For Gemini 2.5, we use Google Cloud Platform s API to perform experiments.
Software Dependencies	No	The paper mentions several software tools and modules such as LMMs-Eval [71], Clip Seg [63], SAM model [64], and Py Qt5 for the annotation tool. However, it does not provide specific version numbers for most of these key software components, except for 'Py Qt5'.
Experiment Setup	Yes	We use greedy decoding with temperature set to 0, and both top-p and top-k set to 1. Following [71], we sample 32 video frames uniformly across the entire video duration. For audio, we average multiple channels to produce a compressed monaural input, with a sampling rate of 16k Hz. We uniformly sample 128 frames from each video. A detection is considered valid if the average score exceeds a threshold: 0.5 for dynamic sounding objects and 0.6 for reference and facing objects. We process spatial audio signals at 0.25s per segment, with a sampling rate of 48 k Hz. The analysis is constrained to the 500 2000 Hz frequency band for speech-related audio cues.