reproducibilityindex.ai

Integrating Both Visual and Audio Cues for Enhanced Video Caption

Authors: Wangli Hao, Zhaoxiang Zhang, He Guan

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments have validated the effectiveness of our three cross-modalities fusion strategies on two benchmark datasets, including Microsoft Research Video to Text (MSRVTT) and Microsoft Video Description (MSVD).
Researcher Affiliation	Academia	Wangli Hao,1,4 Zhaoxiang Zhang,1,2,3,4, He Guan 1,4 1Research Center for Brain-inspired Intelligence, CASIA 2National Laboratory of Pattern Recognition, CASIA 3CAS Center for Excellence in Brain Science and Intelligence Technology, CAS 4University of Chinese Academy of Sciences
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any concrete access information (e.g., specific repository link, explicit statement of code release, or mention of code in supplementary materials) for the source code of the methodology described.
Open Datasets	Yes	To validate the performance of our model, we utilize the Microsoft Research-Video to Text Dataset (MSR-VTT) and Microsoft Video Description Dataset (MSVD) (Chen and Dolan 2011). Their split method can be found in (Xu et al. 2016) and (Yao et al. 2015) respectively.
Dataset Splits	Yes	Their split method can be found in (Xu et al. 2016) and (Yao et al. 2015) respectively. ... Parameter P is tuned on the validation set.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details, such as library names with version numbers, needed to replicate the experiment.
Experiment Setup	Yes	Parameters are set as follows, beam search size, word embedding dimension and LSTM hidden state dimension are 5, 468 and 512 respectively. Size of visual-auditory and visual-textual shared memories are 64 128 and 128 512 respectively. To avoid overﬁtting, dropout (Srivastava et al. 2014) with 0.5 rate are utilized on both the output of fully connected layer and the output layers of LSTM, but not on the intermediate recurrent transitions. In addition, gradients are clipped into range [-10,10] to prevent gradient explosion. Optimization algorithm utilized for our deep feature fusion frameworks is ADADELTA (Zeiler 2012).