Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
Authors: Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez, Dan Ellis, John R. Hershey
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data. This dataset contains a wide diversity of sound classes recorded in unconstrained conditions, making the application of previous methods unsuitable. For evaluation and semi-supervised experiments, we collected human labels for presence of on-screen and off-screen sounds on a small subset of clips. |
| Researcher Affiliation | Collaboration | 1University of Illinois at Urbana-Champaign 2Google Research |
| Pseudocode | No | The paper describes network architectures and loss functions in detail but does not include any explicit pseudocode blocks or algorithms. |
| Open Source Code | Yes | A recipe for these will be available on the project webpage: https://audioscope.github.io. |
| Open Datasets | Yes | In order to train on real-world audio-visual recording environments for our open-domain system, we use the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100m) (Thomee et al., 2016). ... Both audio and visual embedding networks were pre-trained on Audio Set (Gemmeke et al., 2017) for unsupervised coincidence prediction (Jansen et al., 2020) and fine-tuned on our data... |
| Dataset Splits | Yes | By splitting on video uploader, we select 1,600 videos for training, and use the remaining videos for validation and test. ... we obtained human annotations for 10,000 unfiltered training clips, 10,000 filtered training clips, and 10,000 filtered validation/test clips. ... We constructed an on-screen-only subset with 836 training, 735 validation, and 295 test clips, and an off-screen-only subset with 3,681 training, 836 validation, and 370 test clips. |
| Hardware Specification | Yes | All models are trained on 4 Google Cloud TPUs (16 chips) with Adam (Kingma & Ba, 2015), batch size 256, and learning rate 10^-4. |
| Software Dependencies | No | The paper mentions using Adam as an optimizer, but does not specify versions for any other software dependencies, libraries, or programming languages. |
| Experiment Setup | Yes | All models are trained on 4 Google Cloud TPUs (16 chips) with Adam (Kingma & Ba, 2015), batch size 256, and learning rate 10^-4. |