Learning Representations from Audio-Visual Spatial Alignment
Authors: Pedro Morgado, Yi Li, Nuno Nvasconcelos
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks, including audio-visual correspondence, spatial alignment, action recognition and video semantic segmentation. Dataset and code are available at https://github.com/pedro-morgado/AVSpatialAlignment. |
| Researcher Affiliation | Academia | Pedro Morgado Yi Li Nuno Vasconcelos Department of Electrical and Computer Engineering University of California, San Diego {pmaravil,yil898,nuno}@eng.ucsd.edu |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Dataset and code are available at https://github.com/pedro-morgado/AVSpatialAlignment. |
| Open Datasets | Yes | Dataset and code are available at https://github.com/pedro-morgado/AVSpatialAlignment. We collected a dataset of 360 video with spatial audio from You Tube, containing clips from a diverse set of topics such as musical performances, vlogs, sports, and others. This diversity is critical to learn good representations. Similarly to prior work [40], search results were cleaned by removing videos that 1) did not contain valid ambisonics, 2) only contain still images, or 3) contain a significant amount of post-production sounds such as voice-overs and background music. The resulting dataset, denoted You Tube-360 (YT-360), contains a total of 5 506 videos, which was split into 4 506 videos for training and 1 000 for testing. |
| Dataset Splits | No | The paper mentions a train/test split for the YT-360 dataset ('4 506 videos for training and 1 000 for testing') but does not specify a separate validation split within this dataset. While it finetunes on UCF and HMDB which may have validation splits, the question asks for details needed to reproduce *their* experiment, which primarily uses YT-360 without a described validation set. |
| Hardware Specification | No | The paper mentions 'distributed over 2 GPUs' and the use of 'the Nautilus platform', but does not provide specific GPU models (e.g., NVIDIA A100, Tesla V100), CPU types, or detailed specifications for the Nautilus platform. |
| Software Dependencies | No | The paper describes various data processing steps and training configurations, but it does not specify any software names with version numbers (e.g., Python 3.8, PyTorch 1.9). |
| Experiment Setup | Yes | Video pre-processing We sampled K = 4 crops per video at different viewing angles. Since up and down viewing directions are often less informative, we restrict the center of each crop to latitudes φ { 60 , 60 }. We also ensure that viewing angles are sampled at least 36 apart. Normal field-of-view (NFOV) crops are extracted using a Gnomonic projection with random angular coverage between 25 and 90 wide for data augmentation. If naive equi-rectangular crops were taken, the distortion patterns of these crops at latitudes outside the horizon line could potentially reveal the vertical position of the crop, allowing the network to cheat the AVSA task. Following NFOV projection, video clips are resized into 112 112 resolution. Random horizontal flipping, color jittering and Z normalization are applied. Each video clip is 0.5s long and is extracted at 16fps. Audio pre-processing First-order ambisonics (FOA) are used for spatial audio. Audio clips for the different viewing angles are generated by simply rotating the ambisonics [31]. One second of audio is extracted at 24k Hz, and four channels (FOA) of normalized log mel-spectrograms are used as the input to the audio encoder. Spectrograms are computed using an STFT with a window of size 21ms, and hop size of 10ms. The extracted frequency components are aggregated in a mel-scale with 128 levels. Architecture and optimization The video encoder fv is the 18-layer R2+1D model [56], and the audio encoder fa is a 9-layer 2D convolutional neural network operating on the time-frequency domain. The translation networks, gv2a and ga2v, are instantiated with depth D = 2. Training is conducted using the Adam optimizer [28] with a batch size of 28 distributed over 2 GPUs, learning rate of 1e 4, weight decay of 1e 5 and default momentum parameters (β1, β2) = (0.9, 0.999). Both curriculum learning phases are trained for 50 epochs. To control for the number of iterations, models trained only on the first or second phases are trained for 100 epochs. |