Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

Authors: Pritam Sarkar, Ali Etemad

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform in-depth studies showing that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations useful for a variety of downstream tasks. To pretrain our proposed solution, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and Audio Set. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and action retrieval. Our experiments show that Criss Cross either outperforms or achieves performances on par with the current state-of-the-art self-supervised methods on action recognition and action retrieval with UCF101 and HMDB51, as well as sound classification with ESC50 and DCASE. Moreover, Criss Cross outperforms fully-supervised pretraining while pretrained on Kinetics-Sound.
Researcher Affiliation Collaboration Pritam Sarkar1, 2, Ali Etemad1 1 Queen s University, Canada 2 Vector Institute {pritam.sarkar, ali.etemad}@queensu.ca
Pseudocode Yes We present the pseudocode in Appendix A.
Open Source Code Yes The codes, pretrained models, and supplementary material are available on the project website1. 1https://pritamqu.github.io/Criss Cross
Open Datasets Yes We use 3 datasets of different sizes: Kinetics-Sound (Arandjelovic and Zisserman 2017), Kinetics400 (Kay et al. 2017), and Audio Set (Gemmeke et al. 2017), to pretrain Criss Cross. We evaluate Criss Cross on different downstream tasks, namely action recognition, sound classification, and action retrieval. We use 2 popular benchmarks UCF101 (Soomro, Zamir, and Shah 2012) and HMDB51 (Kuehne et al. 2011) to perform action recognition and retrieval, while ESC50 (Piczak 2015) and DCASE (Stowell et al. 2015) are used for sound classification.
Dataset Splits No We tune the model using the split-1 of both datasets and report the top-1 accuracy averaged over all the splits. This sentence implies specific splits are used, but doesn't provide concrete details like percentages or explicit train/validation/test counts or a detailed splitting methodology for reproduction.
Hardware Specification Yes Criss Cross 4 GPUs R(2+1)D-18 and Criss Cross 8 GPUs R(2+1)D-18 in Table 6.
Software Dependencies No The paper mentions 'Adam (Kingma and Ba 2015) optimizer' and 'cosine learning rate scheduler (Loshchilov and Hutter 2017)' and architectural backbones like 'R(2+1)D' and 'Res Net', but does not provide specific version numbers for any software, libraries, or frameworks used for implementation.
Experiment Setup Yes We use Adam (Kingma and Ba 2015) optimizer with a cosine learning rate scheduler (Loshchilov and Hutter 2017) to pretrain the encoders and use a fixed learning rate to train the predictors. [...] we find that a higher predictor learning rate helps the network to learn better representations. In particular, setting the predictor learning rate to be the same as the base learning rate results in unstable training, and the loss curve shows oscillating behavior. We empirically find that setting the predictor learning rate to 10 times the base learning rate works well. [...] we downsample the visual streams to 16 frames per second and feed 8 frames of resolution 1122 to the visual encoder. Next, we downsample the audio signals to 16k Hz, and segment them into 2-second segments. We transform the segmented raw audio waveforms to mel-spectrograms using 80 mel filters, we set the hop size as 10 milliseconds and FFT window length as 1024. Finally, we feed spectrograms of shape 80 200 to the audio encoder. We perform linear evaluations using 8 frames of visual input and 2 seconds of audio input. [...] For a fair comparison to earlier works, we adopt 2 setups for finetuning, once with 8 frames, and the other with 32 frames. In both these setups, we use a spatial resolution of 2242.