Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding
Authors: Yuanhao Xiong, Long Zhao, Boqing Gong, Ming-Hsuan Yang, Florian Schroff, Ting Liu, Cho-Jui Hsieh, Liangzhe Yuan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive evaluations demonstrate that S-Vi LM performs favorably against existing approaches. Specifically, S-Vi LM surpasses the state-of-the-art methods substantially on four representative downstream tasks, covering text-video retrieval, video question answering, video action recognition, and temporal action localization. |
| Researcher Affiliation | Collaboration | Yuanhao Xiong1,3 Long Zhao1 Boqing Gong1 Ming-Hsuan Yang1 Florian Schroff1 Ting Liu1 Cho-Jui Hsieh2,3 Liangzhe Yuan1 1Google Research 2Google 3UCLA |
| Pseudocode | No | The paper describes its methods and components, such as the grouping block structure in Figure 3, but does not provide explicit pseudocode or algorithm blocks with numbered steps. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | We pre-train S-Vi LM with the Video CC (Nagrani et al., 2022) dataset, which contains about 3.3M video-caption pairs. We also include Activity Net-Caption (Krishna et al., 2017) with 20K well-aligned pairs into the pre-training corpus. We adopt the widely used text-video retrieval benchmark MSR-VTT (Xu et al., 2016) for evaluation. We consider open-ended VQA settings with two representative datasets: (1) MSRVTT-QA (Xu et al., 2017) and (2) MSVD-QA (Xu et al., 2017). We select HMDB51 (Kuehne et al., 2011) containing 6,766 videos with 51 categories and UCF101 (Soomro et al., 2012) containing 13,320 videos with 101 categories. |
| Dataset Splits | Yes | For fine-tuning setup, we follow Bain et al. (2021) and Ge et al. (2022a), and train and test the model on the split of 9K and 1K videos. |
| Hardware Specification | Yes | We implement S-Vi LM in JAX and train all models on TPU accelerators. |
| Software Dependencies | No | The paper mentions software like JAX and spaCy but does not provide specific version numbers for any of its software dependencies. |
| Experiment Setup | Yes | During pre-training, SGD with momentum 0.9 and initial learning rate 0.1 is used for optimization. We train S-Vi LM for 10 epochs with a batch size 1024 and adopt a cosine learning rate decay schedule with a warmup ratio 0.05. It takes about one day for the whole pre-training stage. In terms of fine-tuning, different tasks are trained independently with their own set of hyperparameters on the target dataset and more details can be found in Appendix A. For example, Table 8 lists 'SGD' as optimizer, '2.5e-1' as base learning rate, '0.9' as optimizer momentum, 'cosine decay' as learning rate schedule, '512' as batch size, '0.1' as warmup ratio, and '20' as training epochs for MSR-VTT. |