BAT: Learning to Reason about Spatial Sounds with Large Language Models

Authors: Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate BAT s superior performance on both spatial sound perception and reasoning, showcasing the immense potential of LLMs in navigating and interpreting complex spatial audio environments.
Researcher Affiliation Academia 1Department of Computer Science, University of Texas at Austin, USA 2Department of Computer Science and Engineering, Shanghai Jiao Tong University, China.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our demo, dataset, code and model weights are available at: https://zhishengzheng.com/BAT.
Open Datasets Yes To address the lack of existing datasets of in-the-wild spatial sounds, we synthesized a binaural audio dataset using Audio Set and Sound Spaces 2.0. ... We use the state-of-the-art audio simulator, Sound Spaces 2.0 (Chen et al., 2022). This platform performs on-the-fly geometry-based sound rendering, enabling realistic acoustic reverberation with arbitrary source-receiver locations. ... we sample from Audio Set (Gemmeke et al., 2017) to specify our monaural sound source As.
Dataset Splits No The paper mentions an 'evaluation set' which serves as the test set but does not explicitly provide details on a separate 'validation' split with sizes or percentages for hyperparameter tuning.
Hardware Specification Yes The encoder is trained on 8 RTX 3090 GPUs, with each epoch taking approximately 10 minutes. ... The training is completed on 8 V100 GPUs.
Software Dependencies No The paper mentions software components like 'Adam W' as an optimizer and 'LLa MA-adapter v2' but does not provide specific version numbers for software libraries or environments (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup Yes The 10-second audio sources are first loudness normalized by scaling them so that each clip has the same total sound energy. ... The resulting waveforms are binaural with 2 channels at a 32k Hz sampling rate. We use a window size of 1024, a hop size of 320, and 128 mel-bins to compute the Short-Time Fourier Transforms (STFTs) and mel-spectrograms. As a result, for a 10-second recording from Audio Set, the concatenated Mel-spectrogram and IPD feature dimension is (4, 1024, 128). ... We implement a patch masking ratio of 0.25 in both time and frequency during training... We initialize the weights of the transformer blocks using the official pretrained Audio MAE (Huang et al., 2022) checkpoint. ... we set the temperature to 0.1 and nucleus sampling (Holtzman et al., 2020) top p to 0.75. ... We present the specific training hyperparameter configurations for SPATIAL-AST & BAT in Table 5.