Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation

Authors: Yan-Bo Lin, Yu-Chiang Frank Wang2056-2063

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios, with ablation studies and visualization further support the use of our model for audio spatialization.
Researcher Affiliation Collaboration Yan-Bo Lin 1 and Yu-Chiang Frank Wang1,2 1Graduate Inst. Communication Engineering, National Taiwan University, Taiwan 2ASUS Intelligent Cloud Services, Taiwan
Pseudocode No The paper describes its method using mathematical equations but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not mention providing open-source code for its methodology.
Open Datasets Yes FAIR-PLAY (Gao and Grauman 2019a). The FAIR-PLAY dataset consists of 1,871 10s clips of videos with binaural recording. REC-STREET (Pedro Morgado and Wang 2018). YT-CLEAN (Pedro Morgado and Wang 2018). YT-MUSIC (Pedro Morgado and Wang 2018).
Dataset Splits Yes As for the train/val/test split, we follow up given splits from FAIR-PLAY dataset.
Hardware Specification Yes We implement our model using Py Torch (Paszke et al. 2019) and train our model on a single NVIDIA GTX 1080 Ti GPU with 12 GB memory.
Software Dependencies No The paper mentions 'Py Torch (Paszke et al. 2019)', but does not specify a version number for PyTorch or other software components used.
Experiment Setup Yes As for audio settings in our experiments, the raw audio data are resampled at 16k HZ. As for the STFT setting, we use a Hann window of length 25ms, FFT size of 512 and hop length of 10ms. During training, we randomly sample one audio segment with 0.63s in a video with the corresponding video frame. As for testing, we sample all the audio segments in a video with 0.05s hop size.