Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
Authors: Kaibin Tian, Yanhua Cheng, Yi Liu, Xinglin Hou, Quan Chen, Han Li
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. |
| Researcher Affiliation | Industry | Kuaishou Technology {tiankaibin, chengyanhua, liuyi24, houxinglin, chenquan06, lihan08}@kuaishou.com |
| Pseudocode | Yes | Algorithm 1: Recall and Re-ranking during Retrieval Input: video gallery set: V = {vi}K i=1, a query text: t Output: the most matching video 1: Encode t as a text feature θ(t) 2: # Reacll Stage 3: for i 1, K do 4: Encode vi as a video-level video feature v L1 ,i 5: Compute the similarity score between θ(t) and v L1 ,i 6: end for 7: Select top k highest score videos as the candidate set. 8: # Re-ranking Stage 9: for i 1, k do 10: Encode vi as a frame-level video feature v L2 ,i 11: Encode vi as a patch-level video feature v L3 ,i 12: Compute the similarity score between θ(t) and weighted sum (v L1 ,i, v L2 ,i, v L3 ,i) 13: end for 14: Select the highest score video as the most matching video |
| Open Source Code | No | The paper does not explicitly state that the source code for the described methodology is publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We perform experiments on the commonly used benchmark of MSR-VTT (Xu et al. 2016), VATEX (Wang et al. 2019), MSVD (Chen and Dolan 2011), and Activity Net (Heilbron et al. 2015). |
| Dataset Splits | No | The paper mentions using standard benchmark datasets like MSR-VTT and VATEX, but does not explicitly provide the specific training, validation, and test split percentages or counts within the text for reproducibility. |
| Hardware Specification | Yes | We perform the experiments on 24 NVIDIA Tesla T4 15GB GPUs using the Py Torch library. |
| Software Dependencies | No | The paper mentions using the 'Py Torch library' but does not specify its version number or other specific software dependencies with their versions. |
| Experiment Setup | Yes | We train our model via Adam optimizer and decay the learning rate using a cosine schedule strategy. For better finetuning, we set different learning rates for different modules, where the spatial encoder and text encoder are set to 1e-7, owning to CLIP initialization, and other new modules, like the temporal encoder, are set to 1e-4. The max word token length and max frame length are fixed to 32 and 12 for MSR-VTT, MSVD, and VATEX, while the corresponding settings are 64 and 64 for Activity Net due to longer captions of videoparagraph retrieval. Limited by GPU memory, we set the batch size of MSR-VTT, MSVD, VATEX, and Activity Net to 240, 240, 360, and 96, respectively. We train 5 epochs for all datasets. Unless otherwise specified, the hyperparameters mentioned in our equations are empirically set as follows: π = 0.1 and π = 0.01 separately for frame-level and patch-level TIB module, {α = 0.05} in Lintra loss, {β = 0.001, λv L1 : λv L2 : λv L3 = 5 : 5 : 1} in the total loss, and top-k=50 for our coarse-to-fine retrieval. |