Video Interactive Captioning with Human Prompts
Authors: Aming Wu, Yahong Han, Yi Yang
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results not only show the prompts can help generate more accurate captions, but also demonstrate the good performance of the proposed method. |
| Researcher Affiliation | Collaboration | Aming Wu1 , Yahong Han1 and Yi Yang2,3 1College of Intelligence and Computing, Tianjin University, Tianjin, China 2School of Computer Science, University of Technology Sydney, Australia 3Baidu Research {tjwam, yahong}@tju.edu.cn, yi.yang@uts.edu.au |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Code is publicly available on Git Hub: https://github.com/Vi Cap01/Vi Cap. |
| Open Datasets | Yes | MSRVTT-2016 [Xu et al., 2016] is the recently released largest dataset for video captioning. |
| Dataset Splits | Yes | For Vi Cap models, we take the 5001st to 8500th clip as the training set. And we take the 8501st to 9000th clip as the validation set and use 9001st to 10000th clip as the test set. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer' and pre-trained models like 'S2VT' and 'HRNE', but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | In the following experiments, we select 20 equally-spaced frames from each video and feed them into Goog Le Net [Szegedy et al., ] to extract a 1,024-dimensional frame-wise representation. For the encoding network of both video and initial caption, the number of output channel is all set to 512. For CNN-D, the number of output channel of each layer is respectively set to 512, 256, 256, 512, and 512. For CNNR, the number of output channel of each layer is set to 512, 256, and 512. For IGRU-D and GRU-R, the number of output channel is set to 512. Finally, during training, we use Adam optimizer with an initial learning rate of 1 10 3. λ1 and λ5 are respectively set to 0.4 and 0.6. β1, β2, and λ are respectively set to 0.6, 0.4, and 0.001. |