Video-Text Pre-training with Learned Regions for Retrieval
Authors: Rui Yan, Mike Zheng Shou, Yixiao Ge, Jinpeng Wang, Xudong Lin, Guanyu Cai, Jinhui Tang
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations on four downstream video-text retrieval benchmarks clearly demonstrate the effectiveness of our Region Learner. |
| Researcher Affiliation | Collaboration | Nanjing University of Science and Technology, Jiangsu, China; Show Lab, National University of Singapore, Singapore; Tencent PCG, Beijing, China; Columbia University, New York, USA; Tongji University, Shanghai, China |
| Pseudocode | No | The paper describes its approach using descriptive text and mathematical equations, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | We will release relevant code and pre-trained model weights to facilitate the research community. |
| Open Datasets | Yes | Following the recent work (Bain et al. 2021), we pretrain our model on an affordably large-scale video-text benchmark (Web Vid-2M (Bain et al. 2021)) and an imagetext benchmark (Google Conceptual Captions (Sharma et al. 2018)). |
| Dataset Splits | No | The paper mentions pretraining on Web Vid-2M and CC3M, and fine-tuning/evaluating on MSR-VTT, Di De Mo, LSMDC, and MSVD. While it refers to a '1K-A test set' for MSR-VTT, it does not specify explicit train/validation/test splits (e.g., percentages or counts) for any of these datasets for reproducibility. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using a ViT and transformer-based architectures, but it does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks with specific versions) required to replicate the experiments. |
| Experiment Setup | Yes | To determine how many regions the model needs to learn, we set the range of K from 20 to 26, and the results are shown in Figure 4a. We can see that if K is too large, it may be difficult for the model to find discriminative regions because the module tends to reserve the whole feature map. On the contrary, if K is too small many fragmentary and weak semantics will be discarded in large quantities, leading to poor results. Our approach achieves the best results with K = 8 regions. ... We tried to build multiple layers of spatial-temporal dependencies on top of the region features and found that too many layers are not good... Thus, we use only single-layer attention. |