Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment

Authors: Peng Jin, Hao Li, Zesen Cheng, Jinfa Huang, Zhennan Wang, Li Yuan, Chang Liu, Jie Chen

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on five datasets, including MSRVTT, LSMDC, MSVD, Activity Net, and Di De Mo, demonstrate that our method outperforms the existing state-of-the-art methods.
Researcher Affiliation Academia Peng Jin1,2 , Hao Li1,2 , Zesen Cheng1,2 , Jinfa Huang1,2 , Zhennan Wang3 , Li Yuan1,2,3 , Chang Liu4 , Jie Chen1,2,3 1School of Electronic and Computer Engineering, Peking University, Shenzhen, China 2AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzhen, China 3Peng Cheng Laboratory, Shenzhen, China 4Department of Automation and BNRist, Tsinghua University, Beijing, China
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Corresponding author: Chang Liu, Jie Chen. *Code is available at https://github.com/jpthu17/Di Co SA.
Open Datasets Yes Datasets. MSR-VTT [Xu et al., 2016] contains 10,000 You Tube videos, each with 20 text descriptions. ... LSMDC [Rohrbach et al., 2015] contains 118,081 video clips from 202 movies. ... MSVD [Chen and Dolan, 2011] contains 1,970 videos. ... Activity Net Caption [Krishna et al., 2017] contains 20K You Tube videos. ... Di De Mo [Anne Hendricks et al., 2017] contains 10k videos annotated 40k text descriptions.
Dataset Splits No The paper specifies training and testing splits for MSR-VTT, MSVD, and Activity Net, but does not explicitly detail a separate validation dataset split with specific numbers or percentages for all datasets. For example, for MSR-VTT it states '9,000 videos for training and 1,000 for testing' without mentioning a validation set.
Hardware Specification Yes We report the average inference time for processing the test set (1k videos and 1k text queries) using two Tesla V100 GPUs.
Software Dependencies No The paper mentions using CLIP (Vi T-B/32) as a pre-trained model and Adam optimizer, but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes If not otherwise specified, we set τ = 0.01, K = 8, α = 0.01, β = 0.005. The network is optimized with the batch size of 128 in 5 epochs. The initial learning rate is 1e-7 for the text encoder and video encoder and 1e-3 for other modules.