reproducibilityindex.ai

Self-Supervised Generation of Spatial Audio for 360° Video

Authors: Pedro Morgado, Nuno Nvasconcelos, Timothy Langlois, Oliver Wang

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce several datasets, including one ﬁlmed ourselves, and one collected in-the-wild from You Tube, consisting of 360 videos uploaded with spatial audio. During training, ground-truth spatial audio serves as self-supervision and a mixed down mono track forms the input to our network. Using our approach, we show that it is possible to infer the spatial location of sound sources based only on 360 video and a mono audio track. Experiments conducted in both datasets show that the proposed neural network can generate plausible spatial audio for 360 video. We further validate each component of the proposed architecture and show its superiority over a state-of-the-art, but domain-independent baseline architecture.
Researcher Affiliation	Collaboration	Pedro Morgado University of California, San Diego Nuno Vasconcelos University of California, San Diego Timothy Langlois Adobe Research, Seattle Oliver Wang Adobe Research, Seattle
Pseudocode	No	The paper describes the architecture and various modules but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	In the interest of reproducibility, code, data and trained models will be made available to the community at https://pedro-morgado.github.io/spatialaudiogen.
Open Datasets	Yes	We introduce several datasets, including one ﬁlmed ourselves, and one collected in-the-wild from You Tube, consisting of 360 videos uploaded with spatial audio. In the interest of reproducibility, code, data and trained models will be made available to the community at https://pedro-morgado.github.io/spatialaudiogen.
Dataset Splits	No	For our experiments, we randomly sample three partitions, each containing 75% of all videos for training and 25% for testing. (No explicit percentage or count for a separate validation set was provided).
Hardware Specification	Yes	The proposed procedure can generate 1s of spatial audio at 48000Hz sampling rate in 103ms, using a single 12GB Titan Xp GPU (3840 cores running at 1.6GHz).
Software Dependencies	No	The paper mentions software components and frameworks like Resnet-18, FlowNet2, ImageNet, and Adam optimizer, but it does not specify any version numbers for these or any other software dependencies.
Experiment Setup	Yes	Networks are trained by back-propagation using the Adam optimizer [28] for 150k iterations (roughly two days) with parameters β1 = 0.9, β2 = 0.999 and ϵ = 1e 8, batch size of 32, learning rate of 1e 4 and weight decay of 0.0005.