Contrastive Learning of Global and Local Video Representations
Authors: shuang ma, Zhaoyang Zeng, Daniel McDuff, Yale Song
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on various downstream tasks that need local spatio-temporal information, i.e., lip reading [22, 21, 3], deep-fake detection [25] and audio-visual event localization [76], and also discriminative tasks that needs global information, i.e., audio/visual video classification [71, 47, 63, 41]. We show that the same pretrained model successfully generalizes to all our scenarios without having to re-pretrain it using different objectives and/or datasets. |
| Researcher Affiliation | Collaboration | Shuang Ma Microsoft Redmond, WA, USA Zhaoyang Zeng Sun Yat-sen University Guangzhou, China Daniel Mc Duff Microsoft Research Redmond, WA, USA Yale Song Microsoft Research Redmond, WA, USA |
| Pseudocode | No | The paper does not contain pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | 1https://github.com/yunyikristy/global_local |
| Open Datasets | Yes | Therefore, we pretrain our model only once for all downstream tasks using a combination of Kinetics [14] and AVSpeech [26]. |
| Dataset Splits | Yes | For a fair comparison with SOTA, we use the standard data processing protocol of [91]. We follow the same data preprocessing protocol as in SOTA approaches for this task, and use the same training and test sets as [20]. For a fair comparison, we followed the same protocol and evaluation metric as [76]. |
| Hardware Specification | Yes | We use 16 NVIDIA Tesla P100 GPUs with a batch size of 32. |
| Software Dependencies | No | The paper mentions using 'Lib ROSA' for mel-spectrogram extraction but does not provide a specific version number, nor does it list other software dependencies with version details. |
| Experiment Setup | Yes | All models are trained end-to-end using ADAM [44] with an initial learning rate γ = 10 3 after a warm-up period of 500 iterations. We set the clip length to 32 frames (3 seconds) and resize frames to 112 112; we feed 8 frames and 32 frames to Eg v and El v, respectively. |