Integrating Both Visual and Audio Cues for Enhanced Video Caption

Authors: Wangli Hao, Zhaoxiang Zhang, He Guan

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments have validated the effectiveness of our three cross-modalities fusion strategies on two benchmark datasets, including Microsoft Research Video to Text (MSRVTT) and Microsoft Video Description (MSVD).
Researcher Affiliation Academia Wangli Hao,1,4 Zhaoxiang Zhang,1,2,3,4, He Guan 1,4 1Research Center for Brain-inspired Intelligence, CASIA 2National Laboratory of Pattern Recognition, CASIA 3CAS Center for Excellence in Brain Science and Intelligence Technology, CAS 4University of Chinese Academy of Sciences
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access information (e.g., specific repository link, explicit statement of code release, or mention of code in supplementary materials) for the source code of the methodology described.
Open Datasets Yes To validate the performance of our model, we utilize the Microsoft Research-Video to Text Dataset (MSR-VTT) and Microsoft Video Description Dataset (MSVD) (Chen and Dolan 2011). Their split method can be found in (Xu et al. 2016) and (Yao et al. 2015) respectively.
Dataset Splits Yes Their split method can be found in (Xu et al. 2016) and (Yao et al. 2015) respectively. ... Parameter P is tuned on the validation set.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library names with version numbers, needed to replicate the experiment.
Experiment Setup Yes Parameters are set as follows, beam search size, word embedding dimension and LSTM hidden state dimension are 5, 468 and 512 respectively. Size of visual-auditory and visual-textual shared memories are 64 128 and 128 512 respectively. To avoid overfitting, dropout (Srivastava et al. 2014) with 0.5 rate are utilized on both the output of fully connected layer and the output layers of LSTM, but not on the intermediate recurrent transitions. In addition, gradients are clipped into range [-10,10] to prevent gradient explosion. Optimization algorithm utilized for our deep feature fusion frameworks is ADADELTA (Zeiler 2012).