Learnability Matters: Active Learning for Video Captioning
Authors: Yiqian Zhang, Buyu Liu, Jun Bao, Qiang Huang, Min Zhang, Jun Yu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Results on publicly available video captioning datasets with diverse video captioning models demonstrate that our algorithm outperforms SOTA active learning methods by a large margin,e.g.we achieve about 103% of full performance on CIDEr with 25% of human annotations on MSR-VTT. |
| Researcher Affiliation | Collaboration | 1Hangzhou Dianzi University 2Harbin Institute of Technology (Shenzhen) 3National University of Singapore yiqian.zyq@gmail.com, {buyu.liu, baojun}@hit.edu.cn, huangq@comp.nus.edu.sg, zhangmin2021@hit.edu.cn, zju.yujun@gmail.com |
| Pseudocode | Yes | Our full caption-wise algorithm is summarized in Alg. 1. |
| Open Source Code | Yes | Our code and model will be made available. |
| Open Datasets | Yes | We conduct our experiments on two datasets, MSVD Chen and Dolan (2011a) and MSRVTT Xu et al. (2016). |
| Dataset Splits | Yes | For each dataset, we follow their standard splits and report our active learning performance on their test sets. To mimic the learning process, we initialize L with 5% of data randomly selected from the training set, including both videos and their annotations. |
| Hardware Specification | Yes | All experiments are conducted on 4 RTX 3090 GPUs and 4 RTX 4090 GPUs with Pytorch Paszke et al. (2019), Huggingface transformers Wolf et al. (2020). |
| Software Dependencies | Yes | All experiments are conducted on 4 RTX 3090 GPUs and 4 RTX 4090 GPUs with Pytorch Paszke et al. (2019), Huggingface transformers Wolf et al. (2020). |
| Experiment Setup | Yes | The hyper-parameters λ1, λ2, λ3 in Eq. 5 are 3, 1, and 2, respectively. And they are chosen based on experiments on the validation set via grid search. Our re-ranking factor q in Eq. 6 is set to 10 on MSR-VTT. Meanwhile, R is set to 3, dividing the ranked videos into 3 regions according to ˆLn. Specifically, both the first and last regions consist of 2000 videos. At,r equals to 2, 1, and 0 with r = {1, 2, 3}. |