Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing
Authors: Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Yang Zheng, Jingyuan Li, Eli Shlizerman
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluation demonstrates that SAVVY substantially enhances performance of state-of-the-art AV-LLMs, setting a new standard and stage for approaching dynamic 3D spatial reasoning in AV-LLMs. |
| Researcher Affiliation | Academia | Department of Electrical & Computer Engineering, University of Washington, Seattle, USA. Department of Applied Mathematics, University of Washington, Seattle, USA Corresponding author: EMAIL |
| Pseudocode | Yes | Algorithm 1 Track Aggregation Algorithm for Global Map Construction |
| Open Source Code | No | Code and benchmark dataset will be released for upon request from the corresponding author. We also aim to release the code and the benchmark dataset as public repository on the Github upon obtaining involved approvals. |
| Open Datasets | No | Code and benchmark dataset will be released for upon request from the corresponding author. We also aim to release the code and the benchmark dataset as public repository on the Github upon obtaining involved approvals. |
| Dataset Splits | No | The paper describes SAVVY-Bench as a benchmark for evaluation of 3D spatial reasoning of AV-LLMs. It details the task taxonomy and statistics of the QA pairs within the benchmark, such as the distribution of QA task types (Figure 2a). However, it does not explicitly provide information on train/test/validation splits of the SAVVY-Bench dataset itself for model evaluation or training purposes within the paper. |
| Hardware Specification | Yes | For baseline AV-LLMs ( 7B scale), we use a single A100 (40GB) to perform experiments. For Gemini 2.5, we use Google Cloud Platform s API to perform experiments. |
| Software Dependencies | No | The paper mentions several software tools and modules such as LMMs-Eval [71], Clip Seg [63], SAM model [64], and Py Qt5 for the annotation tool. However, it does not provide specific version numbers for most of these key software components, except for 'Py Qt5'. |
| Experiment Setup | Yes | We use greedy decoding with temperature set to 0, and both top-p and top-k set to 1. Following [71], we sample 32 video frames uniformly across the entire video duration. For audio, we average multiple channels to produce a compressed monaural input, with a sampling rate of 16k Hz. We uniformly sample 128 frames from each video. A detection is considered valid if the average score exceeds a threshold: 0.5 for dynamic sounding objects and 0.6 for reference and facing objects. We process spatial audio signals at 0.25s per segment, with a sampling rate of 48 k Hz. The analysis is constrained to the 500 2000 Hz frequency band for speech-related audio cues. |