Audio-Visual Localization by Synthetic Acoustic Image Generation

Authors: Valentina Sanguineti, Pietro Morerio, Alessio Del Bue, Vittorio Murino2523-2531

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We assess the quality of the generated synthetic acoustic images on the task of unsupervised sound source localization in a qualitative and quantitative manner, while also considering standard generation metrics. Our model is evaluated by considering both multimodal datasets containing acoustic images, used for the training, and unseen datasets containing just monaural audio signals and RGB frames, showing to reach more accurate localization results as compared to the state of the art. We present a set of experiments to evaluate the quality of the reconstruction of spatialized audio in terms of classification and localization performance.
Researcher Affiliation Collaboration Valentina Sanguineti,1, 2 Pietro Morerio, 1 Alessio Del Bue, 3 Vittorio Murino 1,4,5 1 Pattern Analysis & Computer Vision, Istituto Italiano di Tecnologia, Genoa, Italy 2 University of Genova, Genoa, Italy 3 Visual Geometry and Modelling, Istituto Italiano di Tecnologia, Genoa, Italy 4 University of Verona, Verona, Italy 5 Huawei Technologies Ltd., Ireland Research Center, Dublin, Ireland
Pseudocode No The paper describes the architecture and method in prose and diagrams but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes 1Code available at https://github.com/IIT-PAVIS/Acoustic Image-Generation
Open Datasets Yes ACIVW (Sanguineti et al. 2020) is a multimodal dataset including acoustic images containing 5 hours of videos acquired in the wild... AVIA (P erez et al. 2020) is a multimodal dataset including acoustic images with 14 different actions... A random subset of Flickr-Sound Net (Aytar, Vondrick, and Torralba 2016)... VGGSound (Chen et al. 2020) is a dataset with over 200k 10s video clips...
Dataset Splits No The paper discusses training and testing datasets, and uses terms like 'training sets' and 'test sets' (e.g., 'In Table 1 we evaluate reconstruction of acoustic images for both the test sets of ACIVW and AVIA datasets'), but does not explicitly provide details about a separate validation set or its specific split ratios.
Hardware Specification No The paper does not provide specific details regarding the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions general software components like Res Net50 and models like VAE and U-Net, but it does not specify versions for any programming languages, libraries, or other software dependencies required to reproduce the experiments.
Experiment Setup No The paper mentions aspects of the training strategy, such as training only the last ResNet50 layer and training time interval, and mentions tuning the weight of latent loss for VAE, but it does not provide concrete hyperparameter values like learning rate, batch size, specific optimizer settings, or the determined value of the tuned latent loss weight.