reproducibilityindex.ai

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment

Authors: Peng Jin, Hao Li, Zesen Cheng, Jinfa Huang, Zhennan Wang, Li Yuan, Chang Liu, Jie Chen

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on five datasets, including MSRVTT, LSMDC, MSVD, Activity Net, and Di De Mo, demonstrate that our method outperforms the existing state-of-the-art methods.
Researcher Affiliation	Academia	Peng Jin1,2 , Hao Li1,2 , Zesen Cheng1,2 , Jinfa Huang1,2 , Zhennan Wang3 , Li Yuan1,2,3 , Chang Liu4 , Jie Chen1,2,3 1School of Electronic and Computer Engineering, Peking University, Shenzhen, China 2AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzhen, China 3Peng Cheng Laboratory, Shenzhen, China 4Department of Automation and BNRist, Tsinghua University, Beijing, China
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Corresponding author: Chang Liu, Jie Chen. *Code is available at https://github.com/jpthu17/Di Co SA.
Open Datasets	Yes	Datasets. MSR-VTT [Xu et al., 2016] contains 10,000 You Tube videos, each with 20 text descriptions. ... LSMDC [Rohrbach et al., 2015] contains 118,081 video clips from 202 movies. ... MSVD [Chen and Dolan, 2011] contains 1,970 videos. ... Activity Net Caption [Krishna et al., 2017] contains 20K You Tube videos. ... Di De Mo [Anne Hendricks et al., 2017] contains 10k videos annotated 40k text descriptions.
Dataset Splits	No	The paper specifies training and testing splits for MSR-VTT, MSVD, and Activity Net, but does not explicitly detail a separate validation dataset split with specific numbers or percentages for all datasets. For example, for MSR-VTT it states '9,000 videos for training and 1,000 for testing' without mentioning a validation set.
Hardware Specification	Yes	We report the average inference time for processing the test set (1k videos and 1k text queries) using two Tesla V100 GPUs.
Software Dependencies	No	The paper mentions using CLIP (Vi T-B/32) as a pre-trained model and Adam optimizer, but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	If not otherwise specified, we set τ = 0.01, K = 8, α = 0.01, β = 0.005. The network is optimized with the batch size of 128 in 5 epochs. The initial learning rate is 1e-7 for the text encoder and video encoder and 1e-3 for other modules.