Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

Authors: Kaibin Tian, Yanhua Cheng, Yi Liu, Xinglin Hou, Quan Chen, Han Li

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness.
Researcher Affiliation Industry Kuaishou Technology {tiankaibin, chengyanhua, liuyi24, houxinglin, chenquan06, lihan08}@kuaishou.com
Pseudocode Yes Algorithm 1: Recall and Re-ranking during Retrieval Input: video gallery set: V = {vi}K i=1, a query text: t Output: the most matching video 1: Encode t as a text feature θ(t) 2: # Reacll Stage 3: for i 1, K do 4: Encode vi as a video-level video feature v L1 ,i 5: Compute the similarity score between θ(t) and v L1 ,i 6: end for 7: Select top k highest score videos as the candidate set. 8: # Re-ranking Stage 9: for i 1, k do 10: Encode vi as a frame-level video feature v L2 ,i 11: Encode vi as a patch-level video feature v L3 ,i 12: Compute the similarity score between θ(t) and weighted sum (v L1 ,i, v L2 ,i, v L3 ,i) 13: end for 14: Select the highest score video as the most matching video
Open Source Code No The paper does not explicitly state that the source code for the described methodology is publicly available, nor does it provide a link to a code repository.
Open Datasets Yes We perform experiments on the commonly used benchmark of MSR-VTT (Xu et al. 2016), VATEX (Wang et al. 2019), MSVD (Chen and Dolan 2011), and Activity Net (Heilbron et al. 2015).
Dataset Splits No The paper mentions using standard benchmark datasets like MSR-VTT and VATEX, but does not explicitly provide the specific training, validation, and test split percentages or counts within the text for reproducibility.
Hardware Specification Yes We perform the experiments on 24 NVIDIA Tesla T4 15GB GPUs using the Py Torch library.
Software Dependencies No The paper mentions using the 'Py Torch library' but does not specify its version number or other specific software dependencies with their versions.
Experiment Setup Yes We train our model via Adam optimizer and decay the learning rate using a cosine schedule strategy. For better finetuning, we set different learning rates for different modules, where the spatial encoder and text encoder are set to 1e-7, owning to CLIP initialization, and other new modules, like the temporal encoder, are set to 1e-4. The max word token length and max frame length are fixed to 32 and 12 for MSR-VTT, MSVD, and VATEX, while the corresponding settings are 64 and 64 for Activity Net due to longer captions of videoparagraph retrieval. Limited by GPU memory, we set the batch size of MSR-VTT, MSVD, VATEX, and Activity Net to 240, 240, 360, and 96, respectively. We train 5 epochs for all datasets. Unless otherwise specified, the hyperparameters mentioned in our equations are empirically set as follows: π = 0.1 and π = 0.01 separately for frame-level and patch-level TIB module, {α = 0.05} in Lintra loss, {β = 0.001, λv L1 : λv L2 : λv L3 = 5 : 5 : 1} in the total loss, and top-k=50 for our coarse-to-fine retrieval.