reproducibilityindex.ai

Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning

Authors: Xin Wang, Jiawei Wu, Da Zhang, Yu Su, William Yang Wang8965-8972

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results not only validate the effectiveness of our method in utilizing semantic knowledge for video captioning, but also show its strong generalization ability when describing novel activities. and Experimental Setup, Evaluation Metrics, Implementation Details, Experiments and Analysis
Researcher Affiliation	Academia	Xin Wang,1 Jiawei Wu,1 Da Zhang,1 Yu Su,2 William Yang Wang1 1University of California, Santa Barbara 2The Ohio State University
Pseudocode	No	The paper describes the model architecture and training process in text and diagrams but does not include any pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statement or link regarding the availability of its source code.
Open Datasets	Yes	We set up the zero-shot learning scenario based on the Activity Net Captions dataset. We use two popular and diverse metrics for language generation, CIDEr, BLEU, METEOR, and ROUGE-L. We further test it on the widely-used MSR-VTT dataset (Xu et al. 2016). and Activity Net (Fabian Caba Heilbron and Niebles 2015)... Recently, (Krishna et al. 2017) have collected the corresponding natural language description for the videos in the Activity Net dataset, leading to the Activity Net-Captions dataset.
Dataset Splits	Yes	We re-split the videos of the 200 activities into the the training set (170 activities), the validation set (15 activities), and the unseen test set (15 activities).
Hardware Specification	Yes	It takes around 6 hours to fully train a model on a TITAN X.
Software Dependencies	No	The paper mentions tools and techniques like 'pretrained fasttext embeddings', 'Adadelta optimizer', and 'variational dropout', but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	Yes	The maximum number of video features is 200 and the maximum caption length is 32. The video encoder is a bi LSTM of size 512, and the decoder LSTM is of size 1024. We initialize all the parameters from a uniform distribution on [ 0.1, 0.1]. Adadelta optimizer (Zeiler 2012) is used with batch size 64. Learning rate starts at 1 and is then halved when the current CIDEr score does not surpass the previous best in 4 epochs. The maximum number of epochs is 100, and we shufﬂe the training data at each epoch. Schedule sampling (Bengio et al. 2015) is also employed to train the models. Beam search of size 5 is used at test time.