reproducibilityindex.ai

Contrastive Learning of Global and Local Video Representations

Authors: shuang ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on various downstream tasks that need local spatio-temporal information, i.e., lip reading [22, 21, 3], deep-fake detection [25] and audio-visual event localization [76], and also discriminative tasks that needs global information, i.e., audio/visual video classiﬁcation [71, 47, 63, 41]. We show that the same pretrained model successfully generalizes to all our scenarios without having to re-pretrain it using different objectives and/or datasets.
Researcher Affiliation	Collaboration	Shuang Ma Microsoft Redmond, WA, USA Zhaoyang Zeng Sun Yat-sen University Guangzhou, China Daniel Mc Duff Microsoft Research Redmond, WA, USA Yale Song Microsoft Research Redmond, WA, USA
Pseudocode	No	The paper does not contain pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	1https://github.com/yunyikristy/global_local
Open Datasets	Yes	Therefore, we pretrain our model only once for all downstream tasks using a combination of Kinetics [14] and AVSpeech [26].
Dataset Splits	Yes	For a fair comparison with SOTA, we use the standard data processing protocol of [91]. We follow the same data preprocessing protocol as in SOTA approaches for this task, and use the same training and test sets as [20]. For a fair comparison, we followed the same protocol and evaluation metric as [76].
Hardware Specification	Yes	We use 16 NVIDIA Tesla P100 GPUs with a batch size of 32.
Software Dependencies	No	The paper mentions using 'Lib ROSA' for mel-spectrogram extraction but does not provide a specific version number, nor does it list other software dependencies with version details.
Experiment Setup	Yes	All models are trained end-to-end using ADAM [44] with an initial learning rate γ = 10 3 after a warm-up period of 500 iterations. We set the clip length to 32 frames (3 seconds) and resize frames to 112 112; we feed 8 frames and 32 frames to Eg v and El v, respectively.