Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DUQ: Dual Uncertainty Quantification for Text-Video Retrieval

Authors: Xin Liu, Shibai Yin, Jun Wang, Jiaxin Zhu, Xingyang Wang, Yee-Hong Yang

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our model on six benchmark datasets: MSRVTT (51.2%), Di De Mo, LSMDC, MSVD, Charades, and VATEX, achieving state-of-the-art retrieval performance. We conduct extensive experiments on six benchmark datasets: MSRVTT, Di De Mo, LSMDC, MSVD, Charades, and VATEX, achieving state-of-the-art retrieval performance (51.2%, +1.9% in R@1 on MSRVTT).
Researcher Affiliation Academia 1 Southwestern University of Finance and Economics 2 University of Alberta EMAIL, EMAIL, EMAIL, EMAIL All listed institutions are universities, indicating an academic affiliation.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. Figure 2 provides a framework diagram, but it is not pseudocode.
Open Source Code Yes *Corresponding author. Code is available at https://github.com/OPA067/DUQ
Open Datasets Yes We adopt six benchmark datasets for evaluation: (1) MSRVTT [Xu et al., 2016] consists of 10K videos, each paired with 20 captions. (2) Di De Mo [Anne Hendricks et al., 2017] contains 10,642 video clips and 40,543 captions. (3) LSMDC [Rohrbach et al., 2015] includes 118,081 video clips from 202 movies. (4) MSVD [Liu et al., 2019] includes 1,970 videos and over 80K captions. (5) Charades [Sigurdsson et al., 2016] consists of 9,848 video clips, and we adopt the split protocol from [Lin et al., 2022]. (6) VATEX [Wang et al., 2019] contains 34K video clips.
Dataset Splits Yes We adopt six benchmark datasets for evaluation: (1) MSRVTT [Xu et al., 2016] ... We follow the training and testing splits from [Yu et al., 2018]. (2) Di De Mo [Anne Hendricks et al., 2017] ... We use the training and testing protocols from [Gabeur et al., 2020]. (3) LSMDC [Rohrbach et al., 2015] ... We use the split from [Torabi et al., 2016], with 1,000 videos reserved for testing. (4) MSVD [Liu et al., 2019] ... training, validation, and test sets containing 1,200, 100, and 670 videos, respectively. (5) Charades [Sigurdsson et al., 2016] ... we adopt the split protocol from [Lin et al., 2022]. (6) VATEX [Wang et al., 2019] ... We follow the train-test split from [Chen et al., 2020].
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments. It mentions using CLIP as a backbone model and batch size but no hardware.
Software Dependencies No The paper mentions using CLIP as a backbone model, which is a software-based framework, but does not specify its version or any other software dependencies with their respective version numbers.
Experiment Setup Yes The batch size is set to 32, and the model is trained for 5 epochs across different datasets. We sample an average of F = 12 frames from each video clip, resizing them to 224 224 pixels for all datasets. The hyper-parameters are set as α = 1 10 1, β = 1 10 4, γ1 = γ2 = 1 10 1, and the number of probabilistic embeddings is K = 7.