Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

Authors: Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, Jianlong Fu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments and evaluate LF-VILA on seven downstream long-form video-language understanding tasks of paragraph-to-video retrieval and long-form video question-answering. We surpass the state-of-the-art models pre-trained on short videos by a large margin. Our results demonstrate the benefit of modeling long-range dependency for long-form videos. We also verify the effectiveness of our proposed MTC loss and HTWA mechanism through ablation studies.
Researcher Affiliation Collaboration Yuchong Sun1 , Hongwei Xue2 , Ruihua Song1 , Bei Liu3 , Huan Yang3, Jianlong Fu3 1Renmin University of China, Beijing, China, 2University of Science and Technology of China, Hefei, China, 3Microsoft Research, Beijing, China
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It describes the methods in regular paragraph text.
Open Source Code Yes We release our code, dataset, and pre-trained models at https://github.com/microsoft/XPretrain.
Open Datasets Yes To facilitate research on long-form video understanding, We build a large-scale long-form video-paragraph dataset based on HD-VILA-100M [53], which is an existing large-scale video-language dataset with diverse categories.
Dataset Splits No The paper describes training epochs and batch sizes but does not explicitly provide percentages or counts for training, validation, and test dataset splits for its own constructed dataset or for the downstream tasks in the main text, deferring some details to supplementary material.
Hardware Specification Yes We train our model with 32 NVIDIA Tesla V100 GPUs.
Software Dependencies No The paper mentions various models and optimizers used (e.g., Swin-Transformer, BERT, Adam W optimizer) but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes During pre-training, our model samples 4 consecutive clip-sentence pairs as input. We uniformly sample 8 frames from each clip and resize the frames to 192 × 320. we use the Word Piece tokenizer like BERT to split each sentence into tokens with a max length of 50. For the video encoder, we use Swin-Transformer [32] as the backbone and integrate our proposed HTWA for frame sequence. Temporal window sizes are set to five stages: 2, 4, 8, 16, and 32, respectively. We use 8 × 8 patches and use a fixed spatial window of 3 × 5, the output feature is down-sampled by 64 times to 3 × 5. We adopt a 12-layer Transformer network for the text encoder, with 8 layers for the first part and 4 layers for the second part. We also use a 12-layer Transformer network for the cross-modal encoder. The weight of the video encoder is initialized with Swin-Transformer pre-trained on ImageNet-21K. We use the first 12 layers of BERT-Large to initialize the weight of the text encoder, and the last 12 layers to initialize the weight of the cross-modal encoder. We use an AdamW optimizer with a learning rate of 5e-5 and warm up the learning rate for 1 epoch, followed by a linear decay, we use a weight decay of 0.05. For stage one, we use a batch size of 512 and train for 6 epochs. For stage two, we use a batch size of 1,536 and train for another 6 epochs.