Video Interactive Captioning with Human Prompts

Authors: Aming Wu, Yahong Han, Yi Yang

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results not only show the prompts can help generate more accurate captions, but also demonstrate the good performance of the proposed method.
Researcher Affiliation Collaboration Aming Wu1 , Yahong Han1 and Yi Yang2,3 1College of Intelligence and Computing, Tianjin University, Tianjin, China 2School of Computer Science, University of Technology Sydney, Australia 3Baidu Research {tjwam, yahong}@tju.edu.cn, yi.yang@uts.edu.au
Pseudocode No The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Code is publicly available on Git Hub: https://github.com/Vi Cap01/Vi Cap.
Open Datasets Yes MSRVTT-2016 [Xu et al., 2016] is the recently released largest dataset for video captioning.
Dataset Splits Yes For Vi Cap models, we take the 5001st to 8500th clip as the training set. And we take the 8501st to 9000th clip as the validation set and use 9001st to 10000th clip as the test set.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No The paper mentions using 'Adam optimizer' and pre-trained models like 'S2VT' and 'HRNE', but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes In the following experiments, we select 20 equally-spaced frames from each video and feed them into Goog Le Net [Szegedy et al., ] to extract a 1,024-dimensional frame-wise representation. For the encoding network of both video and initial caption, the number of output channel is all set to 512. For CNN-D, the number of output channel of each layer is respectively set to 512, 256, 256, 512, and 512. For CNNR, the number of output channel of each layer is set to 512, 256, and 512. For IGRU-D and GRU-R, the number of output channel is set to 512. Finally, during training, we use Adam optimizer with an initial learning rate of 1 10 3. λ1 and λ5 are respectively set to 0.4 and 0.6. β1, β2, and λ are respectively set to 0.6, 0.4, and 0.001.