reproducibilityindex.ai

Learnability Matters: Active Learning for Video Captioning

Authors: Yiqian Zhang, Buyu Liu, Jun Bao, Qiang Huang, Min Zhang, Jun Yu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Results on publicly available video captioning datasets with diverse video captioning models demonstrate that our algorithm outperforms SOTA active learning methods by a large margin,e.g.we achieve about 103% of full performance on CIDEr with 25% of human annotations on MSR-VTT.
Researcher Affiliation	Collaboration	1Hangzhou Dianzi University 2Harbin Institute of Technology (Shenzhen) 3National University of Singapore yiqian.zyq@gmail.com, {buyu.liu, baojun}@hit.edu.cn, huangq@comp.nus.edu.sg, zhangmin2021@hit.edu.cn, zju.yujun@gmail.com
Pseudocode	Yes	Our full caption-wise algorithm is summarized in Alg. 1.
Open Source Code	Yes	Our code and model will be made available.
Open Datasets	Yes	We conduct our experiments on two datasets, MSVD Chen and Dolan (2011a) and MSRVTT Xu et al. (2016).
Dataset Splits	Yes	For each dataset, we follow their standard splits and report our active learning performance on their test sets. To mimic the learning process, we initialize L with 5% of data randomly selected from the training set, including both videos and their annotations.
Hardware Specification	Yes	All experiments are conducted on 4 RTX 3090 GPUs and 4 RTX 4090 GPUs with Pytorch Paszke et al. (2019), Huggingface transformers Wolf et al. (2020).
Software Dependencies	Yes	All experiments are conducted on 4 RTX 3090 GPUs and 4 RTX 4090 GPUs with Pytorch Paszke et al. (2019), Huggingface transformers Wolf et al. (2020).
Experiment Setup	Yes	The hyper-parameters λ1, λ2, λ3 in Eq. 5 are 3, 1, and 2, respectively. And they are chosen based on experiments on the validation set via grid search. Our re-ranking factor q in Eq. 6 is set to 10 on MSR-VTT. Meanwhile, R is set to 3, dividing the ranked videos into 3 regions according to ˆLn. Specifically, both the first and last regions consist of 2000 videos. At,r equals to 2, 1, and 0 with r = {1, 2, 3}.