reproducibilityindex.ai

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Authors: Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez, Dan Ellis, John R. Hershey

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data. This dataset contains a wide diversity of sound classes recorded in unconstrained conditions, making the application of previous methods unsuitable. For evaluation and semi-supervised experiments, we collected human labels for presence of on-screen and off-screen sounds on a small subset of clips.
Researcher Affiliation	Collaboration	1University of Illinois at Urbana-Champaign 2Google Research
Pseudocode	No	The paper describes network architectures and loss functions in detail but does not include any explicit pseudocode blocks or algorithms.
Open Source Code	Yes	A recipe for these will be available on the project webpage: https://audioscope.github.io.
Open Datasets	Yes	In order to train on real-world audio-visual recording environments for our open-domain system, we use the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100m) (Thomee et al., 2016). ... Both audio and visual embedding networks were pre-trained on Audio Set (Gemmeke et al., 2017) for unsupervised coincidence prediction (Jansen et al., 2020) and ﬁne-tuned on our data...
Dataset Splits	Yes	By splitting on video uploader, we select 1,600 videos for training, and use the remaining videos for validation and test. ... we obtained human annotations for 10,000 unﬁltered training clips, 10,000 ﬁltered training clips, and 10,000 ﬁltered validation/test clips. ... We constructed an on-screen-only subset with 836 training, 735 validation, and 295 test clips, and an off-screen-only subset with 3,681 training, 836 validation, and 370 test clips.
Hardware Specification	Yes	All models are trained on 4 Google Cloud TPUs (16 chips) with Adam (Kingma & Ba, 2015), batch size 256, and learning rate 10^-4.
Software Dependencies	No	The paper mentions using Adam as an optimizer, but does not specify versions for any other software dependencies, libraries, or programming languages.
Experiment Setup	Yes	All models are trained on 4 Google Cloud TPUs (16 chips) with Adam (Kingma & Ba, 2015), batch size 256, and learning rate 10^-4.