Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
DUQ: Dual Uncertainty Quantification for Text-Video Retrieval
Authors: Xin Liu, Shibai Yin, Jun Wang, Jiaxin Zhu, Xingyang Wang, Yee-Hong Yang
IJCAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our model on six benchmark datasets: MSRVTT (51.2%), Di De Mo, LSMDC, MSVD, Charades, and VATEX, achieving state-of-the-art retrieval performance. We conduct extensive experiments on six benchmark datasets: MSRVTT, Di De Mo, LSMDC, MSVD, Charades, and VATEX, achieving state-of-the-art retrieval performance (51.2%, +1.9% in R@1 on MSRVTT). |
| Researcher Affiliation | Academia | 1 Southwestern University of Finance and Economics 2 University of Alberta EMAIL, EMAIL, EMAIL, EMAIL All listed institutions are universities, indicating an academic affiliation. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. Figure 2 provides a framework diagram, but it is not pseudocode. |
| Open Source Code | Yes | *Corresponding author. Code is available at https://github.com/OPA067/DUQ |
| Open Datasets | Yes | We adopt six benchmark datasets for evaluation: (1) MSRVTT [Xu et al., 2016] consists of 10K videos, each paired with 20 captions. (2) Di De Mo [Anne Hendricks et al., 2017] contains 10,642 video clips and 40,543 captions. (3) LSMDC [Rohrbach et al., 2015] includes 118,081 video clips from 202 movies. (4) MSVD [Liu et al., 2019] includes 1,970 videos and over 80K captions. (5) Charades [Sigurdsson et al., 2016] consists of 9,848 video clips, and we adopt the split protocol from [Lin et al., 2022]. (6) VATEX [Wang et al., 2019] contains 34K video clips. |
| Dataset Splits | Yes | We adopt six benchmark datasets for evaluation: (1) MSRVTT [Xu et al., 2016] ... We follow the training and testing splits from [Yu et al., 2018]. (2) Di De Mo [Anne Hendricks et al., 2017] ... We use the training and testing protocols from [Gabeur et al., 2020]. (3) LSMDC [Rohrbach et al., 2015] ... We use the split from [Torabi et al., 2016], with 1,000 videos reserved for testing. (4) MSVD [Liu et al., 2019] ... training, validation, and test sets containing 1,200, 100, and 670 videos, respectively. (5) Charades [Sigurdsson et al., 2016] ... we adopt the split protocol from [Lin et al., 2022]. (6) VATEX [Wang et al., 2019] ... We follow the train-test split from [Chen et al., 2020]. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments. It mentions using CLIP as a backbone model and batch size but no hardware. |
| Software Dependencies | No | The paper mentions using CLIP as a backbone model, which is a software-based framework, but does not specify its version or any other software dependencies with their respective version numbers. |
| Experiment Setup | Yes | The batch size is set to 32, and the model is trained for 5 epochs across different datasets. We sample an average of F = 12 frames from each video clip, resizing them to 224 224 pixels for all datasets. The hyper-parameters are set as α = 1 10 1, β = 1 10 4, γ1 = γ2 = 1 10 1, and the number of probabilistic embeddings is K = 7. |