Self-Supervised Generation of Spatial Audio for 360° Video

Authors: Pedro Morgado, Nuno Nvasconcelos, Timothy Langlois, Oliver Wang

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce several datasets, including one filmed ourselves, and one collected in-the-wild from You Tube, consisting of 360 videos uploaded with spatial audio. During training, ground-truth spatial audio serves as self-supervision and a mixed down mono track forms the input to our network. Using our approach, we show that it is possible to infer the spatial location of sound sources based only on 360 video and a mono audio track. Experiments conducted in both datasets show that the proposed neural network can generate plausible spatial audio for 360 video. We further validate each component of the proposed architecture and show its superiority over a state-of-the-art, but domain-independent baseline architecture.
Researcher Affiliation Collaboration Pedro Morgado University of California, San Diego Nuno Vasconcelos University of California, San Diego Timothy Langlois Adobe Research, Seattle Oliver Wang Adobe Research, Seattle
Pseudocode No The paper describes the architecture and various modules but does not include any pseudocode or algorithm blocks.
Open Source Code Yes In the interest of reproducibility, code, data and trained models will be made available to the community at https://pedro-morgado.github.io/spatialaudiogen.
Open Datasets Yes We introduce several datasets, including one filmed ourselves, and one collected in-the-wild from You Tube, consisting of 360 videos uploaded with spatial audio. In the interest of reproducibility, code, data and trained models will be made available to the community at https://pedro-morgado.github.io/spatialaudiogen.
Dataset Splits No For our experiments, we randomly sample three partitions, each containing 75% of all videos for training and 25% for testing. (No explicit percentage or count for a separate validation set was provided).
Hardware Specification Yes The proposed procedure can generate 1s of spatial audio at 48000Hz sampling rate in 103ms, using a single 12GB Titan Xp GPU (3840 cores running at 1.6GHz).
Software Dependencies No The paper mentions software components and frameworks like Resnet-18, FlowNet2, ImageNet, and Adam optimizer, but it does not specify any version numbers for these or any other software dependencies.
Experiment Setup Yes Networks are trained by back-propagation using the Adam optimizer [28] for 150k iterations (roughly two days) with parameters β1 = 0.9, β2 = 0.999 and ϵ = 1e 8, batch size of 32, learning rate of 1e 4 and weight decay of 0.0005.