Active Contrastive Learning of Audio-Visual Video Representations
Authors: Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our model achieves state-of-the-art performance on challenging audio and visual downstream benchmarks including UCF101, HMDB51 and ESC50. |
| Researcher Affiliation | Collaboration | Shuang Ma Microsoft Redmond, WA, USA Zhaoyang Zeng Sun Yat-sen University Guangzhou, China Daniel Mc Duff Microsoft Research Redmond, WA, USA Yale Song Microsoft Research Redmond, WA, USA |
| Pseudocode | Yes | Algorithm 1 describes our proposed cross-modal active contrastive coding... Algorithm 2 Cross-Modal Active Contrastive Coding (Detailed version of Algorithm 1)... Algorithm 3 k-MEANS++ INIT Seed Cluster Initialization... Algorithm 4 Cross-Modal Contrastive Coding without Active Sampling |
| Open Source Code | Yes | 1Code is available at: https://github.com/yunyikristy/CM-ACC |
| Open Datasets | Yes | When pretrained on Audio Set (Gemmeke et al., 2017), our approach achieves new state-of-the-art classification performance on UCF101 (Soomro et al., 2012), HMDB51 (Kuehne et al., 2011), and ESC50 (Piczak, 2015b). |
| Dataset Splits | Yes | UCF101 and HMDB51 have 3 official train/test splits, while ESC50 has 5 splits. We conduct our ablation study using split-1 of each dataset. We report our average performance over all splits when we compare with prior work. |
| Hardware Specification | Yes | We used 40 NVIDIA Tesla P100 GPUs for our experiments. |
| Software Dependencies | No | All models are trained end-to-end with the ADAM optimizer (Kingma & Ba, 2014) (No specific version numbers for Adam or other software dependencies are provided.) |
| Experiment Setup | Yes | All models are trained end-to-end with the ADAM optimizer (Kingma & Ba, 2014) with an initial learning rate γ = 10 3 after a warm-up period of 500 iterations. We use the mini-batch size M = 128, dictionary size K = 30 128, pool size N = 300 128, momentum m = 0.999, and temperature τ = 0.7. |