Augmented Partial Mutual Learning with Frame Masking for Video Captioning
Authors: Ke Lin, Zhuoxin Gan, Liwei Wang2047-2055
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments performed on MSR-VTT and MSVD datasets demonstrate our proposed algorithm achieves the state-of-the-art performance. |
| Researcher Affiliation | Collaboration | Ke Lin 1,2, Zhuoxin Gan 2, Liwei Wang 1 1Peking University, China 2 Samsung Research China Beijing (SRC-B), China {ke17.lin, zhuoxin1.gan}@samsung.com, wanglw@pku.edu.cn, |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper references a BERT model from 'https://github.com/huggingface/transformers' which is a third-party tool, but does not provide specific access to its own source code for the methodology described. |
| Open Datasets | Yes | We conduct experiments on two benchmark datasets which are Microsoft Research Video Description Corpus (MSVD) and Microsoft Research video to text (MSR-VTT). MSVD. It contains 1970 You Tube short video clips in 10 seconds to 25 seconds and each video clip depicts a single activity. Each video clip has about 40 English descriptions. We use the public splits which take 1200 video clips for training, 100 clips for validation and 670 clips for testing. MSR-VTT. We use the initial version of MSR-VTT, referred as MSR-VTT-10K which has 10k video clips and each video clip has 20 descriptions annotated by 1327 workers from Amazon Mechanical Turk. MSR-VTT has 200k video-caption pairs and 29316 unique words. |
| Dataset Splits | Yes | We use the public splits which take 1200 video clips for training, 100 clips for validation and 670 clips for testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'BERT (Delvin et al. 2018) model' and 'https://github.com/huggingface/transformers' but does not specify version numbers for Python, PyTorch, or any other libraries or frameworks used in their implementation. |
| Experiment Setup | Yes | We uniformly sample N = 16 frames for each video. We select top 10 proposals with higher output probabilities for each frame. We pre-train our scene graph construction model with an Adam optimizer and the learning rate is 5e-4. The batch size is 64 and the dropout rate is 0.3, the word embedding dimension e = 512. For GRU and LSTM decoder, the model size and all hidden size are 512. For transformer decoder, the layer number is 6, the number of head is 8 and the model dimension is 512. We train the captioning model using an Adam optimizer. We set hyper-parameters by Q = 300, K = 10, λ1 = λ2 = 1e-3, λ3 = 1. |