Audio-Visual Contrastive Learning with Temporal Self-Supervision

Authors: Simon Jenni, Alexander Black, John Collomosse

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify our model design in extensive ablation experiments and evaluate the video and audio representations in transfer experiments to action recognition and retrieval on UCF101 and HMBD51, audio classification on ESC50, and robust video fingerprinting on VGG-Sound, with state-of-the-art results.
Researcher Affiliation Collaboration Simon Jenni1, Alexander Black2, John Collomosse1,2 1 Adobe Research 2 University of Surrey jenni@adobe.com, alex.black@surrey.ac.uk, collomos@adobe.com
Pseudocode No The paper describes the components and equations of its model and loss functions but does not provide any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code for the described methodology, nor does it include a link to a code repository.
Open Datasets Yes As a pre-training dataset we use Kinetics (Zisserman et al. 2017) in most of our experiments. [...] For transfer experiments we consider UCF101 (Soomro, Zamir, and Shah 2012) and HMDB51 (Kuehne et al. 2011) [...] We evaluate the audio branch of our model on ESC50 (Piczak 2015) [...] Finally, we use the test set of VGG-Sound (Chen et al. 2020a).
Dataset Splits No The paper uses standard datasets like UCF101 and HMDB51, which typically have predefined splits. However, the paper does not explicitly state the train/validation/test split percentages or sample counts, nor does it cite the specific methodology for these splits within the text.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions using the 'Adam W optimizer' and the 'Aug Ly library' but does not provide specific version numbers for these or any other software dependencies crucial for reproducibility.
Experiment Setup Yes We train the models using the Adam W optimizer (Loshchilov and Hutter 2017) with a weight decay set to 10 4. The learning rate follows a cosine annealing schedule (Loshchilov and Hutter 2016) with a maximum learning rate of 3 10 4 and linear warm-up in the first training epoch. By default, we train all the models with a batch size of 256. [...] The projection MLPs ψ contain two hidden layers of size 1024 and output feature embeddings of size 256. The prediction MLPs φ contain a single hidden layer with a hidden dimension of 1024. [...] If not specified otherwise, input video clips are assumed to contain 16 frames of resolution 112 112 for R(2+1)D, 128 128 for R3D-18, and 224 224 for R3D34. Our audio encoder Fa is based on a standard Res Net-34 (He et al. 2016) architecture in all experiments. Input spectrograms to the audio encoder are resized to 224 224.