Stitching Segments and Sentences towards Generalization in Video-Text Pre-training
Authors: Fan Ma, Xiaojie Jin, Heng Wang, Jingjia Huang, Linchao Zhu, Yi Yang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on various benchmarks covering text-to-video retrieval, video question answering, video captioning, and moment retrieval. Our results demonstrate that the proposed method significantly improves the generalization capacity of the video-text pretraining models. |
| Researcher Affiliation | Collaboration | Fan Ma1*, Xiaojie Jin2 , Heng Wang2, Jingjia Huang2, Linchao Zhu1, Yi Yang1 1 Zhejiang University 2 Bytedance Inc. mafan@zju.edu.cn, {jinxiaojie, heng.wang, huangjingjia}@bytedance.com, {zhulinchao, yangyics}@zju.edu.cn |
| Pseudocode | No | The paper presents methods using text descriptions and mathematical equations but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any statements or links indicating that source code for the described methodology is publicly available. |
| Open Datasets | Yes | Following recent work (Huang et al. 2023), we use the Web Vid (Bain et al. 2021b) and the Google Conceptual Captions (Sharma et al. 2018) as the training data. |
| Dataset Splits | No | The paper mentions various datasets used for training and evaluation but does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or sample counts) needed for reproduction. |
| Hardware Specification | Yes | We pre-train our model for 40 epochs, using a batch size of 2048 on 64 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions using specific models like 'Video Swin' and 'BERT-base model' but does not provide specific version numbers for underlying software frameworks or libraries (e.g., PyTorch version, Python version). |
| Experiment Setup | Yes | We pre-train our model for 40 epochs, using a batch size of 2048 on 64 NVIDIA V100 GPUs. We use Adam W (Loshchilov and Hutter 2019) optimizer with a weight decay 0.005 and betas (0.9, 0.98). The learning rate is first set to 5e-5 and then decays by 10 times following a cosine annealing decay schedule. All video frames are resized to 224 224, and 8 frames are randomly sampled in a video while the temporal order is preserved. During pre-training, all words in the sentence is random masked with 15% probability to enable the mask language modeling in both normal and causal attentions. |