Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Video-Text Pre-training with Learned Regions for Retrieval
Authors: Rui Yan, Mike Zheng Shou, Yixiao Ge, Jinpeng Wang, Xudong Lin, Guanyu Cai, Jinhui Tang
AAAI 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations on four downstream video-text retrieval benchmarks clearly demonstrate the effectiveness of our Region Learner. |
| Researcher Affiliation | Collaboration | Nanjing University of Science and Technology, Jiangsu, China; Show Lab, National University of Singapore, Singapore; Tencent PCG, Beijing, China; Columbia University, New York, USA; Tongji University, Shanghai, China |
| Pseudocode | No | The paper describes its approach using descriptive text and mathematical equations, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | We will release relevant code and pre-trained model weights to facilitate the research community. |
| Open Datasets | Yes | Following the recent work (Bain et al. 2021), we pretrain our model on an affordably large-scale video-text benchmark (Web Vid-2M (Bain et al. 2021)) and an imagetext benchmark (Google Conceptual Captions (Sharma et al. 2018)). |
| Dataset Splits | No | The paper mentions pretraining on Web Vid-2M and CC3M, and fine-tuning/evaluating on MSR-VTT, Di De Mo, LSMDC, and MSVD. While it refers to a '1K-A test set' for MSR-VTT, it does not specify explicit train/validation/test splits (e.g., percentages or counts) for any of these datasets for reproducibility. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using a ViT and transformer-based architectures, but it does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks with specific versions) required to replicate the experiments. |
| Experiment Setup | Yes | To determine how many regions the model needs to learn, we set the range of K from 20 to 26, and the results are shown in Figure 4a. We can see that if K is too large, it may be difficult for the model to find discriminative regions because the module tends to reserve the whole feature map. On the contrary, if K is too small many fragmentary and weak semantics will be discarded in large quantities, leading to poor results. Our approach achieves the best results with K = 8 regions. ... We tried to build multiple layers of spatial-temporal dependencies on top of the region features and found that too many layers are not good... Thus, we use only single-layer attention. |