Diffusion-Inspired Truncated Sampler for Text-Video Retrieval
Authors: JIAMIAN WANG, Pichao WANG, Dongfang Liu, Qiang Guan, Sohail Dianat, MAJID RABBANI, Raghuveer Rao, Zhiqiang Tao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on five benchmark datasets suggest the state-of-the-art performance of DITS. We empirically find that DITS can also improve the structure of the CLIP embedding space. |
| Researcher Affiliation | Collaboration | Jiamian Wang1 , Pichao Wang2 , Dongfang Liu1, Qiang Guan3, Sohail Dianat1, Majid Rabbani1, Raghuveer Rao4, Zhiqiang Tao1 1Rochester Institute of Technology, 2Amazon, 3Kent State University, 4DEVCOM Army Research Laboratory |
| Pseudocode | No | The paper includes diagrams illustrating the process (e.g., Figure 2) with some pseudo-code-like elements, but it does not feature a clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Code is available at https://github.com/JiamianWang/DITS-text-video-retrieval |
| Open Datasets | Yes | We employ five benchmark datasets for evaluation. Firstly, we utilize MSRVTT [Xu et al., 2016], comprising 10, 000 YouTube video clips (each having 20 captions) and follow the 1K-A testing split in Liu et al. [2019]. Secondly, LSMDC [Rohrbach et al., 2015] includes 118, 081 text-video pairs, providing videos with longer duration. The testing set contains 1000 videos, as per Gabeur et al. [2020], Gorti et al. [2022]. Thirdly, Di De Mo [Anne Hendricks et al., 2017] contains 40, 000 captions and 10, 000 video clips. We adhere to the data splits detailed in Luo et al. [2022], Jin et al. [2023]. Fourthly, VATEX [Wang et al., 2019] comprises 41, 250 video clips, where each is paired with ten English and ten Chinese descriptions. We follow the split in Chen et al. [2020a]. Lastly, Charades [Sigurdsson et al., 2016] contains 9848 video clips, each with multiple text descriptions detailing daily activities and actions. We adopt the split protocol of Lin et al. [2022]. |
| Dataset Splits | Yes | Firstly, we utilize MSRVTT [Xu et al., 2016], comprising 10, 000 YouTube video clips (each having 20 captions) and follow the 1K-A testing split in Liu et al. [2019]. Secondly, LSMDC [Rohrbach et al., 2015] includes 118, 081 text-video pairs, providing videos with longer duration. The testing set contains 1000 videos, as per Gabeur et al. [2020], Gorti et al. [2022]. Thirdly, Di De Mo [Anne Hendricks et al., 2017] contains 40, 000 captions and 10, 000 video clips. We adhere to the data splits detailed in Luo et al. [2022], Jin et al. [2023]. Fourthly, VATEX [Wang et al., 2019] comprises 41, 250 video clips, where each is paired with ten English and ten Chinese descriptions. We follow the split in Chen et al. [2020a]. Lastly, Charades [Sigurdsson et al., 2016] contains 9848 video clips, each with multiple text descriptions detailing daily activities and actions. We adopt the split protocol of Lin et al. [2022]. |
| Hardware Specification | Yes | We implement DITS with PyTorch [Paszke et al., 2019] and perform experiments on an NVIDIA A100 GPU. All training and inference costs are measured with the same computational platform (2 NVIDIA RTX3090 GPU-24GB, Intel i9-10900X CPU). |
| Software Dependencies | Yes | We implement DITS with PyTorch [Paszke et al., 2019] and perform experiments on an NVIDIA A100 GPU. All of the parameters Θ = {θ, ϕ, γ} are trained with an Adam W [Loshchilov and Hutter, 2017] optimizer with weight decay of 0.2 and warmup rate of 0.1. |
| Experiment Setup | Yes | The dropout is set to 0.3. Different from DiT [Peebles and Xie, 2023], we set our denoising network (also used as the alignment network in DITS) with N = 4 blocks, with 16 heads and an MLP ratio of 4.0. We let the dimension d = 512 for the whole model. We find that a timestamp of T = 10 is enough for diffusion-based alignment. For DITS, we set the truncated timestamp T = 5 for Di De Mo and T = 10 for others. A linear variance schedule with β = 0.1 and β = 0.99 is adopted. All of the parameters Θ = {θ, ϕ, γ} are trained with an Adam W [Loshchilov and Hutter, 2017] optimizer with weight decay of 0.2 and warmup rate of 0.1. We set the training epochs to 5 for all datasets and adopt the same seed of 24. We perform contrastive learning with a batch size of B = 32 for all datasets and backbones. Same as X-Pool [Gorti et al., 2022], the learning rate of the CLIP model is initialized as 1 × 10−5. The learning rate for non-CLIP modules is 3 × 10−5 for MSRVTT [Xu et al., 2016] and 1 × 10−5 for all the other datasets. |