Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data
Authors: Haytham M. Fayek, Anurag Kumar
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the large scale sound events dataset, Audio Set, demonstrate the efficacy of the proposed model, which outperforms the single-modal models, and state-of-the-art fusion and multi-modal models. We achieve a mean Average Precision (m AP) of 46.16 on Audioset, outperforming prior state of the art by approximately +4.35 m AP (relative: 10.4%). |
| Researcher Affiliation | Industry | Facebook Reality Labs, Redmond, WA, USA {haythamfayek, anuragkr}@fb.com |
| Pseudocode | No | The paper describes model architectures and mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or provide links to a code repository for the methodology described. |
| Open Datasets | Yes | Audioset [Gemmeke et al., 2017] is the largest dataset for sound events. The dataset provides You Tube videos for 527 sound events. |
| Dataset Splits | Yes | The training set comprises approximately 2 million videos, whereas the evaluation set comprises approximately 20, 000 videos. We sample approximately 25, 000 videos from the training set to use as the validation set. |
| Hardware Specification | No | The paper does not specify the exact hardware (e.g., GPU, CPU models, or cloud resources) used for running the experiments. |
| Software Dependencies | No | The paper mentions using the Adam optimizer but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | The network in the video model is trained for 20 epochs using the Adam optimizer [Kingma and Ba, 2014] with a mini-batch size of 144. In the fusion experiments, all neural networks are trained using Adam for 100 epochs. The mini-batch size is set to 256. na, nv, and nav are all single layer networks with 512 units and sigmoid activations. |