Diffusion-Inspired Truncated Sampler for Text-Video Retrieval

Authors: JIAMIAN WANG, Pichao WANG, Dongfang Liu, Qiang Guan, Sohail Dianat, MAJID RABBANI, Raghuveer Rao, Zhiqiang Tao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on five benchmark datasets suggest the state-of-the-art performance of DITS. We empirically find that DITS can also improve the structure of the CLIP embedding space.
Researcher Affiliation Collaboration Jiamian Wang1 , Pichao Wang2 , Dongfang Liu1, Qiang Guan3, Sohail Dianat1, Majid Rabbani1, Raghuveer Rao4, Zhiqiang Tao1 1Rochester Institute of Technology, 2Amazon, 3Kent State University, 4DEVCOM Army Research Laboratory
Pseudocode No The paper includes diagrams illustrating the process (e.g., Figure 2) with some pseudo-code-like elements, but it does not feature a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Code is available at https://github.com/JiamianWang/DITS-text-video-retrieval
Open Datasets Yes We employ five benchmark datasets for evaluation. Firstly, we utilize MSRVTT [Xu et al., 2016], comprising 10, 000 YouTube video clips (each having 20 captions) and follow the 1K-A testing split in Liu et al. [2019]. Secondly, LSMDC [Rohrbach et al., 2015] includes 118, 081 text-video pairs, providing videos with longer duration. The testing set contains 1000 videos, as per Gabeur et al. [2020], Gorti et al. [2022]. Thirdly, Di De Mo [Anne Hendricks et al., 2017] contains 40, 000 captions and 10, 000 video clips. We adhere to the data splits detailed in Luo et al. [2022], Jin et al. [2023]. Fourthly, VATEX [Wang et al., 2019] comprises 41, 250 video clips, where each is paired with ten English and ten Chinese descriptions. We follow the split in Chen et al. [2020a]. Lastly, Charades [Sigurdsson et al., 2016] contains 9848 video clips, each with multiple text descriptions detailing daily activities and actions. We adopt the split protocol of Lin et al. [2022].
Dataset Splits Yes Firstly, we utilize MSRVTT [Xu et al., 2016], comprising 10, 000 YouTube video clips (each having 20 captions) and follow the 1K-A testing split in Liu et al. [2019]. Secondly, LSMDC [Rohrbach et al., 2015] includes 118, 081 text-video pairs, providing videos with longer duration. The testing set contains 1000 videos, as per Gabeur et al. [2020], Gorti et al. [2022]. Thirdly, Di De Mo [Anne Hendricks et al., 2017] contains 40, 000 captions and 10, 000 video clips. We adhere to the data splits detailed in Luo et al. [2022], Jin et al. [2023]. Fourthly, VATEX [Wang et al., 2019] comprises 41, 250 video clips, where each is paired with ten English and ten Chinese descriptions. We follow the split in Chen et al. [2020a]. Lastly, Charades [Sigurdsson et al., 2016] contains 9848 video clips, each with multiple text descriptions detailing daily activities and actions. We adopt the split protocol of Lin et al. [2022].
Hardware Specification Yes We implement DITS with PyTorch [Paszke et al., 2019] and perform experiments on an NVIDIA A100 GPU. All training and inference costs are measured with the same computational platform (2 NVIDIA RTX3090 GPU-24GB, Intel i9-10900X CPU).
Software Dependencies Yes We implement DITS with PyTorch [Paszke et al., 2019] and perform experiments on an NVIDIA A100 GPU. All of the parameters Θ = {θ, ϕ, γ} are trained with an Adam W [Loshchilov and Hutter, 2017] optimizer with weight decay of 0.2 and warmup rate of 0.1.
Experiment Setup Yes The dropout is set to 0.3. Different from DiT [Peebles and Xie, 2023], we set our denoising network (also used as the alignment network in DITS) with N = 4 blocks, with 16 heads and an MLP ratio of 4.0. We let the dimension d = 512 for the whole model. We find that a timestamp of T = 10 is enough for diffusion-based alignment. For DITS, we set the truncated timestamp T = 5 for Di De Mo and T = 10 for others. A linear variance schedule with β = 0.1 and β = 0.99 is adopted. All of the parameters Θ = {θ, ϕ, γ} are trained with an Adam W [Loshchilov and Hutter, 2017] optimizer with weight decay of 0.2 and warmup rate of 0.1. We set the training epochs to 5 for all datasets and adopt the same seed of 24. We perform contrastive learning with a batch size of B = 32 for all datasets and backbones. Same as X-Pool [Gorti et al., 2022], the learning rate of the CLIP model is initialized as 1 × 10−5. The learning rate for non-CLIP modules is 3 × 10−5 for MSRVTT [Xu et al., 2016] and 1 × 10−5 for all the other datasets.