Learning Spatially-Aware Language and Audio Embeddings

Authors: Bhavika Devnani, Skyler Seto, Zakaria Aldeneh, Alessandro Toso, Elena Menyaylenko, Barry-John Theobald, Jonathan Sheaffer, Miguel Sarabia

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In particular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 above the LAIONCLAP [44] baseline, and outperforms by 11.6 mean-absolute-error in 3D source localization over the Seld NET [40] baseline on the TUT Sound Events 2018 benchmark [1].
Researcher Affiliation Collaboration Bhavika Devnani1, Skyler Seto2 Zakaria Aldeneh2 Alessandro Toso2 Elena Menyaylenko2 Barry-John Theobald2 Jonathan Sheaffer2 Miguel Sarabia2 1 Georgia Institute of Technology 2 Apple bdevnani3@gatech.edu, {sseto, zaldeneh, atoso}@apple.com {elenam, bjtheobald, sheaffer, miguelsdc}@apple.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code, datasets, and models will be made publicly available at https://github.com/apple/ml-spatial-audio-elsa.
Open Datasets Yes We use Audio Caps [18], Clotho [6], and Freesound [8] as base datasets for our augmentation pipeline.
Dataset Splits Yes We ensure that the room augmentations do not overlap between the train, evaluation, and test datasets. We generate two different sized versions of the evaluation and test sets. The larger version consists, once more, of at least two augmentations per audio sample, whilst the smaller version has no repeated samples and, consequently, is the same size as the original test set. The smaller dataset allows reporting retrieval results on the same sized dataset as the original, as size uniformity is key to consistency in retrieval metrics. The size of the respective datasets is reported in Appendix A.1. DATASET SPATIAL AUDIO SPLITS NUM. SAMPLES DURATION (HRS) CAPTION DESCRIPTION Clotho train, val, test 3,839 23.99 5 captions per audio Audio Caps train, val, test 49,274 136.87 1 2 captions per audio Free Sound train, val, test 414127 2,528.15 1 2 captions per audio, keyword tags Spatial-Clotho Synthetic train, val, test 8,546 55.0 5 spatially augmented captions per audio Spatial-Audio Caps Synthetic train, val, test 98,459 258.12 1 2 spatially augmented captions per audio Spatial-Free Sound Synthetic train, val, test 783,033 4,425.53 1 2 spatially augmented captions per audio Spatial-RWD Recorded test 70 0.25 1 2 human annotated spatial captions per audio
Hardware Specification Yes For our best model, we train for 40 epochs on 12 nodes, each with 8 NVIDIA A100 GPUs and 96 CPU cores with a batch size of 2,304. Training converges within 17 hours. The training dataset took 2 weeks to generate, and utilized 96 CPUs and 16T of disk space. Each full training run of the model takes roughly 1600 A100 GPUhs. This was tested on multi-node GPU machines with up to 12 nodes. The batch size was 1024 and the model was trained for 100 epochs on a single node with 8 NVIDIA V100 GPUs and 80CPUs.
Software Dependencies No The paper mentions various models (e.g., GPT-2, LLaMA-13B, RoBERTa-base, HTSAT) and optimizers (Adam, LAMB) but does not provide specific version numbers for the software libraries or frameworks used in the implementation.
Experiment Setup Yes For our best model, we train for 40 epochs on 12 nodes, each with 8 NVIDIA A100 GPUs and 96 CPU cores with a batch size of 2,304. Training converges within 17 hours. We use the Adam optimizer with a learning rate of 5 10 5 and cosine scheduling. We select the checkpoint with the lowest m AP@10 retrieval on the spatially augmented captions. The spatial attributes branch has 485,828 parameters, and was pre-trained with a learning rate of 10 3 on the LAMB optimizer [47] with weight decay factor of 0.01 and without scheduling the learning rate. The batch size was 1024 and the model was trained for 100 epochs on a single node with 8 NVIDIA V100 GPUs and 80CPUs.