Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach
Authors: Zechen Bai, Tianjun Xiao, Tong He, Pichao WANG, Zheng Zhang, Thomas Brox, Mike Zheng Shou
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method achieves state-of-the-art results across multiple benchmarks, demonstrating the power of data-centric approaches in addressing information asymmetry in TVR. ... We address the problem... Our method achieves state-of-the-art performance on standard benchmarks such as MSR-VTT (Xu et al., 2016b), MSVD (Wu et al., 2017b), LSMDC (Rohrbach et al., 2015b), and VATEX (Wang et al., 2019), demonstrating its robustness and effectiveness across diverse datasets. ... 4 EXPERIMENTS Setups. We employ MSR-VTT Xu et al. (2016a), MSVD Wu et al. (2017a), LSMDC Rohrbach et al. (2015a), and VATEX Wang et al. (2019) to evaluate our method. We use standard retrieval metrics: recall at rank K (R@K, higher is better), median rank (Md R, lower is better), and mean rank (Mn R, lower is better) to evaluate the performance. ... 4.2 ABLATION STUDY |
| Researcher Affiliation | Collaboration | Zechen Bai1, Tianjun Xiao2, Tong He2, Pichao Wang2, Zheng Zhang2, Thomas Brox2,3, Mike Zheng Shou1 1Show Lab, National University of Singapore 2Amazon 3University of Freiburg |
| Pseudocode | Yes | The main idea is to iteratively select queries in a way that maximizes the minimum distance between the selected queries. The algorithm starts with an initial query and repeatedly selects the query that is farthest from all the previously selected queries. This process continues until the desired number of queries is selected. ... The formal expression is Eq. 9, where Q is the original query, {Q 1, ..., Q k} are the enriched queries, k is the number of query selection (k n). The distance is computed at an embedding space. {Q, Q 1, ..., Q k} = FQS({Q, Qtest 1 , ..., Qtest n }, k), (9) Figure 3: Illustration of Farthest Query Sampling (FQS) algorithm. The queries are distributed within a certain range of relevance. The blue point (user query) is set as the root query. At each step, FQS samples the query that is farthest from all previous (n-1) queries |
| Open Source Code | No | REPRODUCIBILITY STATEMENT We have made every effort to ensure that our results are fully reproducible. Detailed descriptions of the proposed framework, including the data processing pipeline, model architecture, and training procedures, are provided in the main paper as well as the comprehensive appendix. To facilitate reproducibility, all datasets used in our experiments are publicly datasets, and the specific processing steps are sufficiently described. We will make the code, data, and pre-trained model public. |
| Open Datasets | Yes | Setups. We employ MSR-VTT Xu et al. (2016a), MSVD Wu et al. (2017a), LSMDC Rohrbach et al. (2015a), and VATEX Wang et al. (2019) to evaluate our method. |
| Dataset Splits | Yes | Datasets. MSR-VTT Xu et al. (2016a) contains a total of 10K video clips, each having 20 captions. We utilize the Training-9k subset for training and the test-1K-A subset for evaluation. MSVD Wu et al. (2017a) contains 1,970 videos with 80K captions, with 40 captions on average per video. There are 1200, 100, and 670 videos in the train, validation, and test sets, respectively. LSMDC Rohrbach et al. (2015a) consists of 118,081 video clips sourced from 202 movies with one caption corresponding to each clip. Evaluation is conducted on a test set of 1,000 videos from movies disjoint from the train and validation sets. VATEX Wang et al. (2019) collects around 35K videos with multiple text annotations in both English and Chinese for each video. There are around 26K videos for training, 1,500 for validation, and 1,500 for testing. |
| Hardware Specification | Yes | We train our model using NVIDIA A10 24G GPU. |
| Software Dependencies | No | For text enrichment in training, we use blip2-opt-2.7b-coco as the captioner. For text enrichment in retrieval, we utilize GPT-4 model through the API 1 to generate queries. |
| Experiment Setup | Yes | For the training hyper-parameters, we also follow X-Pool Gorti et al. (2022). The model is trained for 3 epochs on the enriched training set and 5 epochs on the standard training set. A cosine scheduler Loshchilov & Hutter (2016) is employed to decay the learning rate. |