EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning

Authors: Jongsuk Kim, Hyeongkeun Lee, Kyeongha Rho, Junmo Kim, Joon Son Chung

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive ablation studies and qualitative results verify the effectiveness of our method. Equi AV outperforms previous works across various audiovisual benchmarks.
Researcher Affiliation Academia 1Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea. Correspondence to: Jongsuk Kim <jskpop@kaist.ac.kr>.
Pseudocode Yes Algorithm 1 Equi AV
Open Source Code Yes The code is available on https://github.com/Jong Suk1/Equi AV
Open Datasets Yes We utilize two prominent audio-visual datasets for our experiments: Audio Set (Gemmeke et al., 2017) and VGGSound (Chen et al., 2020a).
Dataset Splits No The paper does not explicitly provide validation dataset splits with percentages or counts for reproduction. While it mentions "evaluation" clips for Audio Set and train/test splits for VGGSound, a distinct "validation" split is not specified.
Hardware Specification Yes GPUs 8 A6000 (Pre-training), 8 A5000 (Fine-tuning)
Software Dependencies No The paper mentions software components and techniques like "Adam W Optimizer", "half-cycle cosine annealing (Loshchilov & Hutter, 2017)", "Vision Transformer", "MAE", and "Spec Augment", but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes The hyperparameter settings used in this paper are listed in Table D. (e.g., Optimizer Adam W Optimizer momentum β1=0.9, β2=0.95 Weight decay 1e-5 Learning rate scheduler half-cycle cosine annealing (Loshchilov & Hutter, 2017) Initial learning rate 1e-6 Peak learning rate 1e-4 Warm-up epochs 2 Epochs 20 Batch size 256)