Integrating Both Visual and Audio Cues for Enhanced Video Caption
Authors: Wangli Hao, Zhaoxiang Zhang, He Guan
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments have validated the effectiveness of our three cross-modalities fusion strategies on two benchmark datasets, including Microsoft Research Video to Text (MSRVTT) and Microsoft Video Description (MSVD). |
| Researcher Affiliation | Academia | Wangli Hao,1,4 Zhaoxiang Zhang,1,2,3,4, He Guan 1,4 1Research Center for Brain-inspired Intelligence, CASIA 2National Laboratory of Pattern Recognition, CASIA 3CAS Center for Excellence in Brain Science and Intelligence Technology, CAS 4University of Chinese Academy of Sciences |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access information (e.g., specific repository link, explicit statement of code release, or mention of code in supplementary materials) for the source code of the methodology described. |
| Open Datasets | Yes | To validate the performance of our model, we utilize the Microsoft Research-Video to Text Dataset (MSR-VTT) and Microsoft Video Description Dataset (MSVD) (Chen and Dolan 2011). Their split method can be found in (Xu et al. 2016) and (Yao et al. 2015) respectively. |
| Dataset Splits | Yes | Their split method can be found in (Xu et al. 2016) and (Yao et al. 2015) respectively. ... Parameter P is tuned on the validation set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library names with version numbers, needed to replicate the experiment. |
| Experiment Setup | Yes | Parameters are set as follows, beam search size, word embedding dimension and LSTM hidden state dimension are 5, 468 and 512 respectively. Size of visual-auditory and visual-textual shared memories are 64 128 and 128 512 respectively. To avoid overfitting, dropout (Srivastava et al. 2014) with 0.5 rate are utilized on both the output of fully connected layer and the output layers of LSTM, but not on the intermediate recurrent transitions. In addition, gradients are clipped into range [-10,10] to prevent gradient explosion. Optimization algorithm utilized for our deep feature fusion frameworks is ADADELTA (Zeiler 2012). |