Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
Authors: Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, Jianlong Fu
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments and evaluate LF-VILA on seven downstream long-form video-language understanding tasks of paragraph-to-video retrieval and long-form video question-answering. We surpass the state-of-the-art models pre-trained on short videos by a large margin. Our results demonstrate the benefit of modeling long-range dependency for long-form videos. We also verify the effectiveness of our proposed MTC loss and HTWA mechanism through ablation studies. |
| Researcher Affiliation | Collaboration | Yuchong Sun1 , Hongwei Xue2 , Ruihua Song1 , Bei Liu3 , Huan Yang3, Jianlong Fu3 1Renmin University of China, Beijing, China, 2University of Science and Technology of China, Hefei, China, 3Microsoft Research, Beijing, China |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. It describes the methods in regular paragraph text. |
| Open Source Code | Yes | We release our code, dataset, and pre-trained models at https://github.com/microsoft/XPretrain. |
| Open Datasets | Yes | To facilitate research on long-form video understanding, We build a large-scale long-form video-paragraph dataset based on HD-VILA-100M [53], which is an existing large-scale video-language dataset with diverse categories. |
| Dataset Splits | No | The paper describes training epochs and batch sizes but does not explicitly provide percentages or counts for training, validation, and test dataset splits for its own constructed dataset or for the downstream tasks in the main text, deferring some details to supplementary material. |
| Hardware Specification | Yes | We train our model with 32 NVIDIA Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions various models and optimizers used (e.g., Swin-Transformer, BERT, Adam W optimizer) but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | During pre-training, our model samples 4 consecutive clip-sentence pairs as input. We uniformly sample 8 frames from each clip and resize the frames to 192 × 320. we use the Word Piece tokenizer like BERT to split each sentence into tokens with a max length of 50. For the video encoder, we use Swin-Transformer [32] as the backbone and integrate our proposed HTWA for frame sequence. Temporal window sizes are set to five stages: 2, 4, 8, 16, and 32, respectively. We use 8 × 8 patches and use a fixed spatial window of 3 × 5, the output feature is down-sampled by 64 times to 3 × 5. We adopt a 12-layer Transformer network for the text encoder, with 8 layers for the first part and 4 layers for the second part. We also use a 12-layer Transformer network for the cross-modal encoder. The weight of the video encoder is initialized with Swin-Transformer pre-trained on ImageNet-21K. We use the first 12 layers of BERT-Large to initialize the weight of the text encoder, and the last 12 layers to initialize the weight of the cross-modal encoder. We use an AdamW optimizer with a learning rate of 5e-5 and warm up the learning rate for 1 epoch, followed by a linear decay, we use a weight decay of 0.05. For stage one, we use a batch size of 512 and train for 6 epochs. For stage two, we use a batch size of 1,536 and train for another 6 epochs. |