Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation
Authors: Yan-Bo Lin, Yu-Chiang Frank Wang2056-2063
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios, with ablation studies and visualization further support the use of our model for audio spatialization. |
| Researcher Affiliation | Collaboration | Yan-Bo Lin 1 and Yu-Chiang Frank Wang1,2 1Graduate Inst. Communication Engineering, National Taiwan University, Taiwan 2ASUS Intelligent Cloud Services, Taiwan |
| Pseudocode | No | The paper describes its method using mathematical equations but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not mention providing open-source code for its methodology. |
| Open Datasets | Yes | FAIR-PLAY (Gao and Grauman 2019a). The FAIR-PLAY dataset consists of 1,871 10s clips of videos with binaural recording. REC-STREET (Pedro Morgado and Wang 2018). YT-CLEAN (Pedro Morgado and Wang 2018). YT-MUSIC (Pedro Morgado and Wang 2018). |
| Dataset Splits | Yes | As for the train/val/test split, we follow up given splits from FAIR-PLAY dataset. |
| Hardware Specification | Yes | We implement our model using Py Torch (Paszke et al. 2019) and train our model on a single NVIDIA GTX 1080 Ti GPU with 12 GB memory. |
| Software Dependencies | No | The paper mentions 'Py Torch (Paszke et al. 2019)', but does not specify a version number for PyTorch or other software components used. |
| Experiment Setup | Yes | As for audio settings in our experiments, the raw audio data are resampled at 16k HZ. As for the STFT setting, we use a Hann window of length 25ms, FFT size of 512 and hop length of 10ms. During training, we randomly sample one audio segment with 0.63s in a video with the corresponding video frame. As for testing, we sample all the audio segments in a video with 0.05s hop size. |