Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning
Authors: Xian Zhong, Zipeng Li, Shuqin Chen, Kui Jiang, Chen Chen, Mang Ye
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments indicate that RSFD outperforms the state-of-the-art methods on two benchmark datasets, i.e., MSR-VTT and MSVD, demonstrate that the enhancement of low-frequency tokens semantics can obtain a competitive generation effect. Code is available at https://github.com/lzp870/RSFD. |
| Researcher Affiliation | Academia | Xian Zhong1, Zipeng Li1, Shuqin Chen2,*, Kui Jiang3, , Chen Chen4, Mang Ye3 1 School of Computer Science and Artificial Intelligence, Wuhan University of Technology 2 College of Computer, Hubei University of Education 3 School of Computer Science, Wuhan University 4 Center for Research in Computer Vision, University of Central Florida {zhongx, lizipeng, csqcwx0801}@whut.edu.cn, kuijiang@whu.edu.cn, chen.chen@crcv.ucf.edu, mangye16@gmail.com |
| Pseudocode | No | The paper describes its algorithms and processes in textual form and through diagrams (e.g., Figure 2 for the RSFD architecture) but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/lzp870/RSFD. |
| Open Datasets | Yes | MSR-VTT (Xu et al. 2016) consists of 10,000 video clips, each annotated with 20 English captions and a category. MSVD (Chen and Dolan 2011) is a widely-used benchmark video captioning dataset collected from You Tube, composed of 1,970 video clips and roughly 80,000 English sentences. |
| Dataset Splits | Yes | Following the standard split, we use the same setup as previous works (Pan et al. 2020; Ye et al. 2022a), which takes 6,513 video clips for training, 497 video clips for validation, and 2,990 video clips for testing on MSR-VTT, as well as 1,200, 100, and 670 videos for training, validation, and testing on MSVD. |
| Hardware Specification | Yes | All our experiments are conducted on two NVIDIA Tesla PH402 SKU 200. |
| Software Dependencies | No | The paper does not specify software versions. It mentions 'Image Net pre-trained Res Net101' and 'Kinetics pre-trained Res Ne Xt-101' and 'Adam' for optimization, but no version numbers for libraries or frameworks like TensorFlow, PyTorch, or specific Python versions are provided. |
| Experiment Setup | Yes | In our implementation, size of video feature dv and the hidden size dh are set to 2,048 and 512. Empirically, we set the sampled frames K = 8 for each video clip. We set maximum sequence length T to 30 on MSR-VTT, whereas T = 20 on MSVD. Transformer decoder has a decoder layer, 8 attention heads, 0.5 dropout ratio, and 0.0005 ℓ2 weight decay. We implement word embeddings by trainable 512 dimensions of embedding layers. In the training phase, we adopt Adam (Kingma and Ba 2015) with an initial learning rate of 0.005 to optimize our model. The batch size is set to 64, and the training epoch is set to 50. During testing, we use the beam-search method with size 5 to generate the predicted sentences. γ and δ for deciding the category of the token are respectively set to 0.015 and 0.0015. We set λ = 0.07 on MSR-VTT and λ = 0.4 on MSVD to demonstrate the significance of the divergent loss. |