Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Authors: Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez, Dan Ellis, John R. Hershey

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data. This dataset contains a wide diversity of sound classes recorded in unconstrained conditions, making the application of previous methods unsuitable. For evaluation and semi-supervised experiments, we collected human labels for presence of on-screen and off-screen sounds on a small subset of clips.
Researcher Affiliation Collaboration 1University of Illinois at Urbana-Champaign 2Google Research
Pseudocode No The paper describes network architectures and loss functions in detail but does not include any explicit pseudocode blocks or algorithms.
Open Source Code Yes A recipe for these will be available on the project webpage: https://audioscope.github.io.
Open Datasets Yes In order to train on real-world audio-visual recording environments for our open-domain system, we use the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100m) (Thomee et al., 2016). ... Both audio and visual embedding networks were pre-trained on Audio Set (Gemmeke et al., 2017) for unsupervised coincidence prediction (Jansen et al., 2020) and fine-tuned on our data...
Dataset Splits Yes By splitting on video uploader, we select 1,600 videos for training, and use the remaining videos for validation and test. ... we obtained human annotations for 10,000 unfiltered training clips, 10,000 filtered training clips, and 10,000 filtered validation/test clips. ... We constructed an on-screen-only subset with 836 training, 735 validation, and 295 test clips, and an off-screen-only subset with 3,681 training, 836 validation, and 370 test clips.
Hardware Specification Yes All models are trained on 4 Google Cloud TPUs (16 chips) with Adam (Kingma & Ba, 2015), batch size 256, and learning rate 10^-4.
Software Dependencies No The paper mentions using Adam as an optimizer, but does not specify versions for any other software dependencies, libraries, or programming languages.
Experiment Setup Yes All models are trained on 4 Google Cloud TPUs (16 chips) with Adam (Kingma & Ba, 2015), batch size 256, and learning rate 10^-4.