Self-Supervised Generation of Spatial Audio for 360° Video
Authors: Pedro Morgado, Nuno Nvasconcelos, Timothy Langlois, Oliver Wang
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce several datasets, including one filmed ourselves, and one collected in-the-wild from You Tube, consisting of 360 videos uploaded with spatial audio. During training, ground-truth spatial audio serves as self-supervision and a mixed down mono track forms the input to our network. Using our approach, we show that it is possible to infer the spatial location of sound sources based only on 360 video and a mono audio track. Experiments conducted in both datasets show that the proposed neural network can generate plausible spatial audio for 360 video. We further validate each component of the proposed architecture and show its superiority over a state-of-the-art, but domain-independent baseline architecture. |
| Researcher Affiliation | Collaboration | Pedro Morgado University of California, San Diego Nuno Vasconcelos University of California, San Diego Timothy Langlois Adobe Research, Seattle Oliver Wang Adobe Research, Seattle |
| Pseudocode | No | The paper describes the architecture and various modules but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | In the interest of reproducibility, code, data and trained models will be made available to the community at https://pedro-morgado.github.io/spatialaudiogen. |
| Open Datasets | Yes | We introduce several datasets, including one filmed ourselves, and one collected in-the-wild from You Tube, consisting of 360 videos uploaded with spatial audio. In the interest of reproducibility, code, data and trained models will be made available to the community at https://pedro-morgado.github.io/spatialaudiogen. |
| Dataset Splits | No | For our experiments, we randomly sample three partitions, each containing 75% of all videos for training and 25% for testing. (No explicit percentage or count for a separate validation set was provided). |
| Hardware Specification | Yes | The proposed procedure can generate 1s of spatial audio at 48000Hz sampling rate in 103ms, using a single 12GB Titan Xp GPU (3840 cores running at 1.6GHz). |
| Software Dependencies | No | The paper mentions software components and frameworks like Resnet-18, FlowNet2, ImageNet, and Adam optimizer, but it does not specify any version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Networks are trained by back-propagation using the Adam optimizer [28] for 150k iterations (roughly two days) with parameters β1 = 0.9, β2 = 0.999 and ϵ = 1e 8, batch size of 32, learning rate of 1e 4 and weight decay of 0.0005. |