Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Authors: Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, Christopher Manning
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | AURORACAP shows superior performance on various video and image captioning benchmarks, for example, obtaining a CIDEr of 88.9 on Flickr30k, beating GPT-4V (55.3) and Gemini-1.5 Pro (82.2). We present results on several widely used benchmarks, but find that existing video understanding benchmarks are either question-answer-based (Song et al., 2023; Chen & Dolan, 2011; Caba Heilbron et al., 2015; Xu et al., 2016; Xiao et al., 2021; Fu et al., 2024; Wu et al., 2024), which cannot demonstrate detailed descriptive abilities, or they provide descriptions that are too short, with only a few words (Xu et al., 2016; Caba Heilbron et al., 2015) as shown in Table 1. We also conduct extensive ablation studies based on AURORACAP. |
| Researcher Affiliation | Collaboration | Wenhao Chai 1,2, Enxin Song Yilun Du 4 Chenlin Meng 2,3 Vashisht Madhavan 2 Omer Bar-Tal 2 Jenq-Neng Hwang 1 Saining Xie 5 Christopher D. Manning 3 1 University of Washington 2 Pika Labs 3 Stanford 4 Harvard 5 New York University |
| Pseudocode | Yes | Token Merging is applied between the attention and MLP within each transformer block as: 1. Alternatively partition the tokens into two sets A and B of roughly equal size. 2. For each token in set A, calculate the token similarity with each token in set B based on cosine similarity of the Key features in attention block. 3. Use bipartite soft matching and then select the most similar r pairs. 4. Merge the tokens using weighted average, record the token size. 5. Concatenate the two sets A and B back together again. |
| Open Source Code | No | The code for model training, evaluation, deployment, as well as model weight, benchmark, and training dataset of all the training stages will be released with the paper. |
| Open Datasets | Yes | Therefore, we develop VDC, a video detailed captioning benchmark with over one thousand carefully annotated structured captions. We utilize GPT-4o as our recaption assistant with a hierarchical prompt design. Panda-70M (Chen et al., 2024e) offers a high-resolution, opendomain You Tube video dataset with diverse one-minute clips across wildlife, cooking, sports, news, TV shows, gaming, and 3D rendering, ideal for studying complex real-world scenarios. We use the training data used in each stage are shown in Table G6, Table G7 and Table G8. |
| Dataset Splits | No | The paper lists datasets used for training (Tables G6, G7, G8) and evaluation (Appendix H) with the number of samples in each, but does not explicitly specify the training/test/validation splits for these datasets when used for AURORACAP's training. For example, it lists '1.3M LAION-CC-SBU-595K' for pretraining but not how this dataset is split into train/val/test for their model's training process. |
| Hardware Specification | Yes | For each input, we process the video at a resolution of 378 378 and sample 8 frames using single H100 GPU. Training costs in H100 hours |
| Software Dependencies | No | We are also thankful to Xtuner*, lmms-eval*, and SGLang* for their well-developed and user-friendly codebase. |
| Experiment Setup | Yes | G DETAILED TRAINING SETTINGS Training hyper-parameters for both stages are shown in Table G9. |