Video-Text Pre-training with Learned Regions for Retrieval

Authors: Rui Yan, Mike Zheng Shou, Yixiao Ge, Jinpeng Wang, Xudong Lin, Guanyu Cai, Jinhui Tang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on four downstream video-text retrieval benchmarks clearly demonstrate the effectiveness of our Region Learner.
Researcher Affiliation Collaboration Nanjing University of Science and Technology, Jiangsu, China; Show Lab, National University of Singapore, Singapore; Tencent PCG, Beijing, China; Columbia University, New York, USA; Tongji University, Shanghai, China
Pseudocode No The paper describes its approach using descriptive text and mathematical equations, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code No We will release relevant code and pre-trained model weights to facilitate the research community.
Open Datasets Yes Following the recent work (Bain et al. 2021), we pretrain our model on an affordably large-scale video-text benchmark (Web Vid-2M (Bain et al. 2021)) and an imagetext benchmark (Google Conceptual Captions (Sharma et al. 2018)).
Dataset Splits No The paper mentions pretraining on Web Vid-2M and CC3M, and fine-tuning/evaluating on MSR-VTT, Di De Mo, LSMDC, and MSVD. While it refers to a '1K-A test set' for MSR-VTT, it does not specify explicit train/validation/test splits (e.g., percentages or counts) for any of these datasets for reproducibility.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper mentions using a ViT and transformer-based architectures, but it does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks with specific versions) required to replicate the experiments.
Experiment Setup Yes To determine how many regions the model needs to learn, we set the range of K from 20 to 26, and the results are shown in Figure 4a. We can see that if K is too large, it may be difficult for the model to find discriminative regions because the module tends to reserve the whole feature map. On the contrary, if K is too small many fragmentary and weak semantics will be discarded in large quantities, leading to poor results. Our approach achieves the best results with K = 8 regions. ... We tried to build multiple layers of spatial-temporal dependencies on top of the region features and found that too many layers are not good... Thus, we use only single-layer attention.