Learnability Matters: Active Learning for Video Captioning

Authors: Yiqian Zhang, Buyu Liu, Jun Bao, Qiang Huang, Min Zhang, Jun Yu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results on publicly available video captioning datasets with diverse video captioning models demonstrate that our algorithm outperforms SOTA active learning methods by a large margin,e.g.we achieve about 103% of full performance on CIDEr with 25% of human annotations on MSR-VTT.
Researcher Affiliation Collaboration 1Hangzhou Dianzi University 2Harbin Institute of Technology (Shenzhen) 3National University of Singapore yiqian.zyq@gmail.com, {buyu.liu, baojun}@hit.edu.cn, huangq@comp.nus.edu.sg, zhangmin2021@hit.edu.cn, zju.yujun@gmail.com
Pseudocode Yes Our full caption-wise algorithm is summarized in Alg. 1.
Open Source Code Yes Our code and model will be made available.
Open Datasets Yes We conduct our experiments on two datasets, MSVD Chen and Dolan (2011a) and MSRVTT Xu et al. (2016).
Dataset Splits Yes For each dataset, we follow their standard splits and report our active learning performance on their test sets. To mimic the learning process, we initialize L with 5% of data randomly selected from the training set, including both videos and their annotations.
Hardware Specification Yes All experiments are conducted on 4 RTX 3090 GPUs and 4 RTX 4090 GPUs with Pytorch Paszke et al. (2019), Huggingface transformers Wolf et al. (2020).
Software Dependencies Yes All experiments are conducted on 4 RTX 3090 GPUs and 4 RTX 4090 GPUs with Pytorch Paszke et al. (2019), Huggingface transformers Wolf et al. (2020).
Experiment Setup Yes The hyper-parameters λ1, λ2, λ3 in Eq. 5 are 3, 1, and 2, respectively. And they are chosen based on experiments on the validation set via grid search. Our re-ranking factor q in Eq. 6 is set to 10 on MSR-VTT. Meanwhile, R is set to 3, dividing the ranked videos into 3 regions according to ˆLn. Specifically, both the first and last regions consist of 2000 videos. At,r equals to 2, 1, and 0 with r = {1, 2, 3}.