Multi-modal Circulant Fusion for Video-to-Language and Backward

Authors: Aming Wu, Yahong Han

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate MCF with tasks of video captioning and temporal activity localization via language (TALL). Experiments on MSVD and MSRVTT show our method obtains the state-of-the-art performance for video captioning. For TALL, by plugging into MCF, we achieve a performance gain of roughly 4.2% on TACo S.
Researcher Affiliation Academia Aming Wu and Yahong Han School of Computer Science and Technology, Tianjin University, Tianjin, China {tjwam, yahong}@tju.edu.cn
Pseudocode No The paper includes a flowchart (Figure 2) illustrating the detailed procedures of Multi-modal Circulant Fusion (MCF), but it does not present pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any specific links to source code or explicitly state that the code will be made publicly available.
Open Datasets Yes MSVD [Chen and Dolan, 2011] contains 1,970 video clips. MSRVTT [Xu et al., 2016] contains 10,000 video clips. TACo S dataset [Regneri et al., 2013].
Dataset Splits Yes For the MSVD dataset, we use 1,200 clips for training, 100 clips for validation, and 670 clips for testing. For the MSRVTT dataset, we use 6,513 clips for training, 497 clips for validation, and 2,990 clips for testing. For TACo S, we split it in 50% for training, 25% for validation and 25% for test.
Hardware Specification No The paper mentions using pre-trained convolutional networks like Goog Le Net and Res Net152 for feature extraction, but it does not specify any hardware details such as GPU models, CPU types, or memory used for training or inference.
Software Dependencies No The paper mentions using 'Adam optimizer' and referring to 'Goog Le Net' and 'Res Net152' models, but it does not specify any software dependencies with version numbers, such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or their specific versions.
Experiment Setup Yes For the multi-stage decoder, we use five dilated layers with dilated rate 1, 1, 2, 4 and 2. The number of filter channel is set to 512, 256, 256, 512 and 512, respectively. The width of filter is set to 2. For MCF, we set W1 R256 512, W2 R256 512 (in Eq. (1)) and W3 R256 512. ... We use Adam optimizer with an initial learning rate of 1 10 3. We empirically set β1 and β2 to 0.9 and 0.1, respectively. And λ0, λ1 and λ2 are set to 0.2, 0.2, and 0.6, respectively.