Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Contrastive Learning of Global and Local Video Representations
Authors: shuang ma, Zhaoyang Zeng, Daniel McDuff, Yale Song
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on various downstream tasks that need local spatio-temporal information, i.e., lip reading [22, 21, 3], deep-fake detection [25] and audio-visual event localization [76], and also discriminative tasks that needs global information, i.e., audio/visual video classification [71, 47, 63, 41]. We show that the same pretrained model successfully generalizes to all our scenarios without having to re-pretrain it using different objectives and/or datasets. |
| Researcher Affiliation | Collaboration | Shuang Ma Microsoft Redmond, WA, USA Zhaoyang Zeng Sun Yat-sen University Guangzhou, China Daniel Mc Duff Microsoft Research Redmond, WA, USA Yale Song Microsoft Research Redmond, WA, USA |
| Pseudocode | No | The paper does not contain pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | 1https://github.com/yunyikristy/global_local |
| Open Datasets | Yes | Therefore, we pretrain our model only once for all downstream tasks using a combination of Kinetics [14] and AVSpeech [26]. |
| Dataset Splits | Yes | For a fair comparison with SOTA, we use the standard data processing protocol of [91]. We follow the same data preprocessing protocol as in SOTA approaches for this task, and use the same training and test sets as [20]. For a fair comparison, we followed the same protocol and evaluation metric as [76]. |
| Hardware Specification | Yes | We use 16 NVIDIA Tesla P100 GPUs with a batch size of 32. |
| Software Dependencies | No | The paper mentions using 'Lib ROSA' for mel-spectrogram extraction but does not provide a specific version number, nor does it list other software dependencies with version details. |
| Experiment Setup | Yes | All models are trained end-to-end using ADAM [44] with an initial learning rate γ = 10 3 after a warm-up period of 500 iterations. We set the clip length to 32 frames (3 seconds) and resize frames to 112 112; we feed 8 frames and 32 frames to Eg v and El v, respectively. |