Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding

Authors: Yuanhao Xiong, Long Zhao, Boqing Gong, Ming-Hsuan Yang, Florian Schroff, Ting Liu, Cho-Jui Hsieh, Liangzhe Yuan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive evaluations demonstrate that S-Vi LM performs favorably against existing approaches. Specifically, S-Vi LM surpasses the state-of-the-art methods substantially on four representative downstream tasks, covering text-video retrieval, video question answering, video action recognition, and temporal action localization.
Researcher Affiliation Collaboration Yuanhao Xiong1,3 Long Zhao1 Boqing Gong1 Ming-Hsuan Yang1 Florian Schroff1 Ting Liu1 Cho-Jui Hsieh2,3 Liangzhe Yuan1 1Google Research 2Google 3UCLA
Pseudocode No The paper describes its methods and components, such as the grouping block structure in Figure 3, but does not provide explicit pseudocode or algorithm blocks with numbered steps.
Open Source Code No The paper does not contain any explicit statement about releasing source code for the described methodology or a link to a code repository.
Open Datasets Yes We pre-train S-Vi LM with the Video CC (Nagrani et al., 2022) dataset, which contains about 3.3M video-caption pairs. We also include Activity Net-Caption (Krishna et al., 2017) with 20K well-aligned pairs into the pre-training corpus. We adopt the widely used text-video retrieval benchmark MSR-VTT (Xu et al., 2016) for evaluation. We consider open-ended VQA settings with two representative datasets: (1) MSRVTT-QA (Xu et al., 2017) and (2) MSVD-QA (Xu et al., 2017). We select HMDB51 (Kuehne et al., 2011) containing 6,766 videos with 51 categories and UCF101 (Soomro et al., 2012) containing 13,320 videos with 101 categories.
Dataset Splits Yes For fine-tuning setup, we follow Bain et al. (2021) and Ge et al. (2022a), and train and test the model on the split of 9K and 1K videos.
Hardware Specification Yes We implement S-Vi LM in JAX and train all models on TPU accelerators.
Software Dependencies No The paper mentions software like JAX and spaCy but does not provide specific version numbers for any of its software dependencies.
Experiment Setup Yes During pre-training, SGD with momentum 0.9 and initial learning rate 0.1 is used for optimization. We train S-Vi LM for 10 epochs with a batch size 1024 and adopt a cosine learning rate decay schedule with a warmup ratio 0.05. It takes about one day for the whole pre-training stage. In terms of fine-tuning, different tasks are trained independently with their own set of hyperparameters on the target dataset and more details can be found in Appendix A. For example, Table 8 lists 'SGD' as optimizer, '2.5e-1' as base learning rate, '0.9' as optimizer momentum, 'cosine decay' as learning rate schedule, '512' as batch size, '0.1' as warmup ratio, and '20' as training epochs for MSR-VTT.