reproducibilityindex.ai

Video Interactive Captioning with Human Prompts

Authors: Aming Wu, Yahong Han, Yi Yang

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results not only show the prompts can help generate more accurate captions, but also demonstrate the good performance of the proposed method.
Researcher Affiliation	Collaboration	Aming Wu1 , Yahong Han1 and Yi Yang2,3 1College of Intelligence and Computing, Tianjin University, Tianjin, China 2School of Computer Science, University of Technology Sydney, Australia 3Baidu Research {tjwam, yahong}@tju.edu.cn, yi.yang@uts.edu.au
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	Code is publicly available on Git Hub: https://github.com/Vi Cap01/Vi Cap.
Open Datasets	Yes	MSRVTT-2016 [Xu et al., 2016] is the recently released largest dataset for video captioning.
Dataset Splits	Yes	For Vi Cap models, we take the 5001st to 8500th clip as the training set. And we take the 8501st to 9000th clip as the validation set and use 9001st to 10000th clip as the test set.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies	No	The paper mentions using 'Adam optimizer' and pre-trained models like 'S2VT' and 'HRNE', but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	In the following experiments, we select 20 equally-spaced frames from each video and feed them into Goog Le Net [Szegedy et al., ] to extract a 1,024-dimensional frame-wise representation. For the encoding network of both video and initial caption, the number of output channel is all set to 512. For CNN-D, the number of output channel of each layer is respectively set to 512, 256, 256, 512, and 512. For CNNR, the number of output channel of each layer is set to 512, 256, and 512. For IGRU-D and GRU-R, the number of output channel is set to 512. Finally, during training, we use Adam optimizer with an initial learning rate of 1 10 3. λ1 and λ5 are respectively set to 0.4 and 0.6. β1, β2, and λ are respectively set to 0.6, 0.4, and 0.001.