Visual Consensus Modeling for Video-Text Retrieval

Authors: Shuqiang Cao, Bairui Wang, Wei Zhang, Lin Ma167-175

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results on the public benchmark datasets demonstrate that our proposed method, with the ability to effectively model the visual consensus, achieves state-of-the-art performance on the bidirectional video-text retrieval task. Our code is available at https://github.com/sqiangcao99/VCM.
Researcher Affiliation Collaboration 1School of Control Science and Engineering, Shandong University 2Meituan
Pseudocode No The paper describes its method through text and diagrams but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/sqiangcao99/VCM.
Open Datasets Yes We perform experiments on two public benchmark datasets for the video-text retrieval task, including the MSRVTT (Xu et al. 2016) and the Activity Net (Krishna et al. 2017).
Dataset Splits Yes The MSR-VTT (Xu et al. 2016) dataset contains 10,000 videos and 200,000 descriptions, where each video is annotated with 20 sentences. Following the setting from (Liu et al. 2019; Miech et al. 2019; Gabeur et al. 2020; Luo et al. 2021), we use 9,000 videos for training and report results on the other 1,000 videos. The Activity Net (Krishna et al. 2017) dataset consists of 20,000 Youtube videos with 100,000 densely annotated descriptions. Following the setting from (Zhang, Hu, and Sha 2018; Gabeur et al. 2020), we perform a video-paragraph retrieval task by concatenating all the descriptions of a video as a paragraph. Performances are reported on the val1 split of the Activity Net.
Hardware Specification Yes All of the experiments are conducted on 4 NVIDIA Tesla V100 GPUs.
Software Dependencies No The paper mentions using the CLIP model and optimizers like Adam and Ada Delta, but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes For the MSR-VTT, the frame sequence length is set to 12, and the word sequence length is set to 32, while for the Activity Net, both the frame and word sentence lengths are set to 64. The dimensions of instance representations, visual concept representations, knowledge constructed representations and knowledge enhanced representations are set to 512. In the CKL module, the number of visual concept representations k for building the cross-modal consensus knowledge graph is set to 300, and 0.3 is assigned to ϵM in Eq. 3 to filter out unreliable relationships in video concepts. In the KI module, we set θ in Eq. 5 to 10 and γ = 0.85 in Eq. 7. Besides, we set λ1, λ2, λ3 and λ4 to 1.0, 0.25, 0.0125 and 0.4 in the loss function Eq. 12, respectively. The hyperparameters β1, β2 and β3 for different types of similarity in Eq.13 are set to 0.35, 0.25, 0.40, respectively. During the training stage, the batch size is set to 128, the learning rate is set to 1e-4, and the max training epoch is set to 10.