Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment
Authors: Peng Jin, Hao Li, Zesen Cheng, Jinfa Huang, Zhennan Wang, Li Yuan, Chang Liu, Jie Chen
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on five datasets, including MSRVTT, LSMDC, MSVD, Activity Net, and Di De Mo, demonstrate that our method outperforms the existing state-of-the-art methods. |
| Researcher Affiliation | Academia | Peng Jin1,2 , Hao Li1,2 , Zesen Cheng1,2 , Jinfa Huang1,2 , Zhennan Wang3 , Li Yuan1,2,3 , Chang Liu4 , Jie Chen1,2,3 1School of Electronic and Computer Engineering, Peking University, Shenzhen, China 2AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzhen, China 3Peng Cheng Laboratory, Shenzhen, China 4Department of Automation and BNRist, Tsinghua University, Beijing, China |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Corresponding author: Chang Liu, Jie Chen. *Code is available at https://github.com/jpthu17/Di Co SA. |
| Open Datasets | Yes | Datasets. MSR-VTT [Xu et al., 2016] contains 10,000 You Tube videos, each with 20 text descriptions. ... LSMDC [Rohrbach et al., 2015] contains 118,081 video clips from 202 movies. ... MSVD [Chen and Dolan, 2011] contains 1,970 videos. ... Activity Net Caption [Krishna et al., 2017] contains 20K You Tube videos. ... Di De Mo [Anne Hendricks et al., 2017] contains 10k videos annotated 40k text descriptions. |
| Dataset Splits | No | The paper specifies training and testing splits for MSR-VTT, MSVD, and Activity Net, but does not explicitly detail a separate validation dataset split with specific numbers or percentages for all datasets. For example, for MSR-VTT it states '9,000 videos for training and 1,000 for testing' without mentioning a validation set. |
| Hardware Specification | Yes | We report the average inference time for processing the test set (1k videos and 1k text queries) using two Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions using CLIP (Vi T-B/32) as a pre-trained model and Adam optimizer, but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | If not otherwise specified, we set τ = 0.01, K = 8, α = 0.01, β = 0.005. The network is optimized with the batch size of 128 in 5 epochs. The initial learning rate is 1e-7 for the text encoder and video encoder and 1e-3 for other modules. |