Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach

Authors: Zechen Bai, Tianjun Xiao, Tong He, Pichao WANG, Zheng Zhang, Thomas Brox, Mike Zheng Shou

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method achieves state-of-the-art results across multiple benchmarks, demonstrating the power of data-centric approaches in addressing information asymmetry in TVR. ... We address the problem... Our method achieves state-of-the-art performance on standard benchmarks such as MSR-VTT (Xu et al., 2016b), MSVD (Wu et al., 2017b), LSMDC (Rohrbach et al., 2015b), and VATEX (Wang et al., 2019), demonstrating its robustness and effectiveness across diverse datasets. ... 4 EXPERIMENTS Setups. We employ MSR-VTT Xu et al. (2016a), MSVD Wu et al. (2017a), LSMDC Rohrbach et al. (2015a), and VATEX Wang et al. (2019) to evaluate our method. We use standard retrieval metrics: recall at rank K (R@K, higher is better), median rank (Md R, lower is better), and mean rank (Mn R, lower is better) to evaluate the performance. ... 4.2 ABLATION STUDY
Researcher Affiliation Collaboration Zechen Bai1, Tianjun Xiao2, Tong He2, Pichao Wang2, Zheng Zhang2, Thomas Brox2,3, Mike Zheng Shou1 1Show Lab, National University of Singapore 2Amazon 3University of Freiburg
Pseudocode Yes The main idea is to iteratively select queries in a way that maximizes the minimum distance between the selected queries. The algorithm starts with an initial query and repeatedly selects the query that is farthest from all the previously selected queries. This process continues until the desired number of queries is selected. ... The formal expression is Eq. 9, where Q is the original query, {Q 1, ..., Q k} are the enriched queries, k is the number of query selection (k n). The distance is computed at an embedding space. {Q, Q 1, ..., Q k} = FQS({Q, Qtest 1 , ..., Qtest n }, k), (9) Figure 3: Illustration of Farthest Query Sampling (FQS) algorithm. The queries are distributed within a certain range of relevance. The blue point (user query) is set as the root query. At each step, FQS samples the query that is farthest from all previous (n-1) queries
Open Source Code No REPRODUCIBILITY STATEMENT We have made every effort to ensure that our results are fully reproducible. Detailed descriptions of the proposed framework, including the data processing pipeline, model architecture, and training procedures, are provided in the main paper as well as the comprehensive appendix. To facilitate reproducibility, all datasets used in our experiments are publicly datasets, and the specific processing steps are sufficiently described. We will make the code, data, and pre-trained model public.
Open Datasets Yes Setups. We employ MSR-VTT Xu et al. (2016a), MSVD Wu et al. (2017a), LSMDC Rohrbach et al. (2015a), and VATEX Wang et al. (2019) to evaluate our method.
Dataset Splits Yes Datasets. MSR-VTT Xu et al. (2016a) contains a total of 10K video clips, each having 20 captions. We utilize the Training-9k subset for training and the test-1K-A subset for evaluation. MSVD Wu et al. (2017a) contains 1,970 videos with 80K captions, with 40 captions on average per video. There are 1200, 100, and 670 videos in the train, validation, and test sets, respectively. LSMDC Rohrbach et al. (2015a) consists of 118,081 video clips sourced from 202 movies with one caption corresponding to each clip. Evaluation is conducted on a test set of 1,000 videos from movies disjoint from the train and validation sets. VATEX Wang et al. (2019) collects around 35K videos with multiple text annotations in both English and Chinese for each video. There are around 26K videos for training, 1,500 for validation, and 1,500 for testing.
Hardware Specification Yes We train our model using NVIDIA A10 24G GPU.
Software Dependencies No For text enrichment in training, we use blip2-opt-2.7b-coco as the captioner. For text enrichment in retrieval, we utilize GPT-4 model through the API 1 to generate queries.
Experiment Setup Yes For the training hyper-parameters, we also follow X-Pool Gorti et al. (2022). The model is trained for 3 epochs on the enriched training set and 5 epochs on the standard training set. A cosine scheduler Loshchilov & Hutter (2016) is employed to decay the learning rate.