reproducibilityindex.ai

Augmented Partial Mutual Learning with Frame Masking for Video Captioning

Authors: Ke Lin, Zhuoxin Gan, Liwei Wang2047-2055

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experiments performed on MSR-VTT and MSVD datasets demonstrate our proposed algorithm achieves the state-of-the-art performance.
Researcher Affiliation	Collaboration	Ke Lin 1,2, Zhuoxin Gan 2, Liwei Wang 1 1Peking University, China 2 Samsung Research China Beijing (SRC-B), China {ke17.lin, zhuoxin1.gan}@samsung.com, wanglw@pku.edu.cn,
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper references a BERT model from 'https://github.com/huggingface/transformers' which is a third-party tool, but does not provide specific access to its own source code for the methodology described.
Open Datasets	Yes	We conduct experiments on two benchmark datasets which are Microsoft Research Video Description Corpus (MSVD) and Microsoft Research video to text (MSR-VTT). MSVD. It contains 1970 You Tube short video clips in 10 seconds to 25 seconds and each video clip depicts a single activity. Each video clip has about 40 English descriptions. We use the public splits which take 1200 video clips for training, 100 clips for validation and 670 clips for testing. MSR-VTT. We use the initial version of MSR-VTT, referred as MSR-VTT-10K which has 10k video clips and each video clip has 20 descriptions annotated by 1327 workers from Amazon Mechanical Turk. MSR-VTT has 200k video-caption pairs and 29316 unique words.
Dataset Splits	Yes	We use the public splits which take 1200 video clips for training, 100 clips for validation and 670 clips for testing.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions 'BERT (Delvin et al. 2018) model' and 'https://github.com/huggingface/transformers' but does not specify version numbers for Python, PyTorch, or any other libraries or frameworks used in their implementation.
Experiment Setup	Yes	We uniformly sample N = 16 frames for each video. We select top 10 proposals with higher output probabilities for each frame. We pre-train our scene graph construction model with an Adam optimizer and the learning rate is 5e-4. The batch size is 64 and the dropout rate is 0.3, the word embedding dimension e = 512. For GRU and LSTM decoder, the model size and all hidden size are 512. For transformer decoder, the layer number is 6, the number of head is 8 and the model dimension is 512. We train the captioning model using an Adam optimizer. We set hyper-parameters by Q = 300, K = 10, λ1 = λ2 = 1e-3, λ3 = 1.