reproducibilityindex.ai

Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding

Authors: Yuanhao Xiong, Long Zhao, Boqing Gong, Ming-Hsuan Yang, Florian Schroff, Ting Liu, Cho-Jui Hsieh, Liangzhe Yuan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive evaluations demonstrate that S-Vi LM performs favorably against existing approaches. Speciﬁcally, S-Vi LM surpasses the state-of-the-art methods substantially on four representative downstream tasks, covering text-video retrieval, video question answering, video action recognition, and temporal action localization.
Researcher Affiliation	Collaboration	Yuanhao Xiong1,3 Long Zhao1 Boqing Gong1 Ming-Hsuan Yang1 Florian Schroff1 Ting Liu1 Cho-Jui Hsieh2,3 Liangzhe Yuan1 1Google Research 2Google 3UCLA
Pseudocode	No	The paper describes its methods and components, such as the grouping block structure in Figure 3, but does not provide explicit pseudocode or algorithm blocks with numbered steps.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the described methodology or a link to a code repository.
Open Datasets	Yes	We pre-train S-Vi LM with the Video CC (Nagrani et al., 2022) dataset, which contains about 3.3M video-caption pairs. We also include Activity Net-Caption (Krishna et al., 2017) with 20K well-aligned pairs into the pre-training corpus. We adopt the widely used text-video retrieval benchmark MSR-VTT (Xu et al., 2016) for evaluation. We consider open-ended VQA settings with two representative datasets: (1) MSRVTT-QA (Xu et al., 2017) and (2) MSVD-QA (Xu et al., 2017). We select HMDB51 (Kuehne et al., 2011) containing 6,766 videos with 51 categories and UCF101 (Soomro et al., 2012) containing 13,320 videos with 101 categories.
Dataset Splits	Yes	For ﬁne-tuning setup, we follow Bain et al. (2021) and Ge et al. (2022a), and train and test the model on the split of 9K and 1K videos.
Hardware Specification	Yes	We implement S-Vi LM in JAX and train all models on TPU accelerators.
Software Dependencies	No	The paper mentions software like JAX and spaCy but does not provide specific version numbers for any of its software dependencies.
Experiment Setup	Yes	During pre-training, SGD with momentum 0.9 and initial learning rate 0.1 is used for optimization. We train S-Vi LM for 10 epochs with a batch size 1024 and adopt a cosine learning rate decay schedule with a warmup ratio 0.05. It takes about one day for the whole pre-training stage. In terms of ﬁne-tuning, different tasks are trained independently with their own set of hyperparameters on the target dataset and more details can be found in Appendix A. For example, Table 8 lists 'SGD' as optimizer, '2.5e-1' as base learning rate, '0.9' as optimizer momentum, 'cosine decay' as learning rate schedule, '512' as batch size, '0.1' as warmup ratio, and '20' as training epochs for MSR-VTT.