reproducibilityindex.ai

Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning

Authors: Xian Zhong, Zipeng Li, Shuqin Chen, Kui Jiang, Chen Chen, Mang Ye

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments indicate that RSFD outperforms the state-of-the-art methods on two benchmark datasets, i.e., MSR-VTT and MSVD, demonstrate that the enhancement of low-frequency tokens semantics can obtain a competitive generation effect. Code is available at https://github.com/lzp870/RSFD.
Researcher Affiliation	Academia	Xian Zhong1, Zipeng Li1, Shuqin Chen2,*, Kui Jiang3, , Chen Chen4, Mang Ye3 1 School of Computer Science and Artificial Intelligence, Wuhan University of Technology 2 College of Computer, Hubei University of Education 3 School of Computer Science, Wuhan University 4 Center for Research in Computer Vision, University of Central Florida {zhongx, lizipeng, csqcwx0801}@whut.edu.cn, kuijiang@whu.edu.cn, chen.chen@crcv.ucf.edu, mangye16@gmail.com
Pseudocode	No	The paper describes its algorithms and processes in textual form and through diagrams (e.g., Figure 2 for the RSFD architecture) but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/lzp870/RSFD.
Open Datasets	Yes	MSR-VTT (Xu et al. 2016) consists of 10,000 video clips, each annotated with 20 English captions and a category. MSVD (Chen and Dolan 2011) is a widely-used benchmark video captioning dataset collected from You Tube, composed of 1,970 video clips and roughly 80,000 English sentences.
Dataset Splits	Yes	Following the standard split, we use the same setup as previous works (Pan et al. 2020; Ye et al. 2022a), which takes 6,513 video clips for training, 497 video clips for validation, and 2,990 video clips for testing on MSR-VTT, as well as 1,200, 100, and 670 videos for training, validation, and testing on MSVD.
Hardware Specification	Yes	All our experiments are conducted on two NVIDIA Tesla PH402 SKU 200.
Software Dependencies	No	The paper does not specify software versions. It mentions 'Image Net pre-trained Res Net101' and 'Kinetics pre-trained Res Ne Xt-101' and 'Adam' for optimization, but no version numbers for libraries or frameworks like TensorFlow, PyTorch, or specific Python versions are provided.
Experiment Setup	Yes	In our implementation, size of video feature dv and the hidden size dh are set to 2,048 and 512. Empirically, we set the sampled frames K = 8 for each video clip. We set maximum sequence length T to 30 on MSR-VTT, whereas T = 20 on MSVD. Transformer decoder has a decoder layer, 8 attention heads, 0.5 dropout ratio, and 0.0005 ℓ2 weight decay. We implement word embeddings by trainable 512 dimensions of embedding layers. In the training phase, we adopt Adam (Kingma and Ba 2015) with an initial learning rate of 0.005 to optimize our model. The batch size is set to 64, and the training epoch is set to 50. During testing, we use the beam-search method with size 5 to generate the predicted sentences. γ and δ for deciding the category of the token are respectively set to 0.015 and 0.0015. We set λ = 0.07 on MSR-VTT and λ = 0.4 on MSVD to demonstrate the significance of the divergent loss.