Stitching Segments and Sentences towards Generalization in Video-Text Pre-training

Authors: Fan Ma, Xiaojie Jin, Heng Wang, Jingjia Huang, Linchao Zhu, Yi Yang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on various benchmarks covering text-to-video retrieval, video question answering, video captioning, and moment retrieval. Our results demonstrate that the proposed method significantly improves the generalization capacity of the video-text pretraining models.
Researcher Affiliation Collaboration Fan Ma1*, Xiaojie Jin2 , Heng Wang2, Jingjia Huang2, Linchao Zhu1, Yi Yang1 1 Zhejiang University 2 Bytedance Inc. mafan@zju.edu.cn, {jinxiaojie, heng.wang, huangjingjia}@bytedance.com, {zhulinchao, yangyics}@zju.edu.cn
Pseudocode No The paper presents methods using text descriptions and mathematical equations but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statements or links indicating that source code for the described methodology is publicly available.
Open Datasets Yes Following recent work (Huang et al. 2023), we use the Web Vid (Bain et al. 2021b) and the Google Conceptual Captions (Sharma et al. 2018) as the training data.
Dataset Splits No The paper mentions various datasets used for training and evaluation but does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or sample counts) needed for reproduction.
Hardware Specification Yes We pre-train our model for 40 epochs, using a batch size of 2048 on 64 NVIDIA V100 GPUs.
Software Dependencies No The paper mentions using specific models like 'Video Swin' and 'BERT-base model' but does not provide specific version numbers for underlying software frameworks or libraries (e.g., PyTorch version, Python version).
Experiment Setup Yes We pre-train our model for 40 epochs, using a batch size of 2048 on 64 NVIDIA V100 GPUs. We use Adam W (Loshchilov and Hutter 2019) optimizer with a weight decay 0.005 and betas (0.9, 0.98). The learning rate is first set to 5e-5 and then decays by 10 times following a cosine annealing decay schedule. All video frames are resized to 224 224, and 8 frames are randomly sampled in a video while the temporal order is preserved. During pre-training, all words in the sentence is random masked with 15% probability to enable the mask language modeling in both normal and causal attentions.