Contrastive Learning of Global and Local Video Representations

Authors: shuang ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach on various downstream tasks that need local spatio-temporal information, i.e., lip reading [22, 21, 3], deep-fake detection [25] and audio-visual event localization [76], and also discriminative tasks that needs global information, i.e., audio/visual video classification [71, 47, 63, 41]. We show that the same pretrained model successfully generalizes to all our scenarios without having to re-pretrain it using different objectives and/or datasets.
Researcher Affiliation Collaboration Shuang Ma Microsoft Redmond, WA, USA Zhaoyang Zeng Sun Yat-sen University Guangzhou, China Daniel Mc Duff Microsoft Research Redmond, WA, USA Yale Song Microsoft Research Redmond, WA, USA
Pseudocode No The paper does not contain pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes 1https://github.com/yunyikristy/global_local
Open Datasets Yes Therefore, we pretrain our model only once for all downstream tasks using a combination of Kinetics [14] and AVSpeech [26].
Dataset Splits Yes For a fair comparison with SOTA, we use the standard data processing protocol of [91]. We follow the same data preprocessing protocol as in SOTA approaches for this task, and use the same training and test sets as [20]. For a fair comparison, we followed the same protocol and evaluation metric as [76].
Hardware Specification Yes We use 16 NVIDIA Tesla P100 GPUs with a batch size of 32.
Software Dependencies No The paper mentions using 'Lib ROSA' for mel-spectrogram extraction but does not provide a specific version number, nor does it list other software dependencies with version details.
Experiment Setup Yes All models are trained end-to-end using ADAM [44] with an initial learning rate γ = 10 3 after a warm-up period of 500 iterations. We set the clip length to 32 frames (3 seconds) and resize frames to 112 112; we feed 8 frames and 32 frames to Eg v and El v, respectively.