Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning
Authors: Xin Wang, Jiawei Wu, Da Zhang, Yu Su, William Yang Wang8965-8972
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results not only validate the effectiveness of our method in utilizing semantic knowledge for video captioning, but also show its strong generalization ability when describing novel activities. and Experimental Setup, Evaluation Metrics, Implementation Details, Experiments and Analysis |
| Researcher Affiliation | Academia | Xin Wang,1 Jiawei Wu,1 Da Zhang,1 Yu Su,2 William Yang Wang1 1University of California, Santa Barbara 2The Ohio State University |
| Pseudocode | No | The paper describes the model architecture and training process in text and diagrams but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement or link regarding the availability of its source code. |
| Open Datasets | Yes | We set up the zero-shot learning scenario based on the Activity Net Captions dataset. We use two popular and diverse metrics for language generation, CIDEr, BLEU, METEOR, and ROUGE-L. We further test it on the widely-used MSR-VTT dataset (Xu et al. 2016). and Activity Net (Fabian Caba Heilbron and Niebles 2015)... Recently, (Krishna et al. 2017) have collected the corresponding natural language description for the videos in the Activity Net dataset, leading to the Activity Net-Captions dataset. |
| Dataset Splits | Yes | We re-split the videos of the 200 activities into the the training set (170 activities), the validation set (15 activities), and the unseen test set (15 activities). |
| Hardware Specification | Yes | It takes around 6 hours to fully train a model on a TITAN X. |
| Software Dependencies | No | The paper mentions tools and techniques like 'pretrained fasttext embeddings', 'Adadelta optimizer', and 'variational dropout', but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | The maximum number of video features is 200 and the maximum caption length is 32. The video encoder is a bi LSTM of size 512, and the decoder LSTM is of size 1024. We initialize all the parameters from a uniform distribution on [ 0.1, 0.1]. Adadelta optimizer (Zeiler 2012) is used with batch size 64. Learning rate starts at 1 and is then halved when the current CIDEr score does not surpass the previous best in 4 epochs. The maximum number of epochs is 100, and we shuffle the training data at each epoch. Schedule sampling (Bengio et al. 2015) is also employed to train the models. Beam search of size 5 is used at test time. |