Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Authors: Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, Christopher Manning

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental AURORACAP shows superior performance on various video and image captioning benchmarks, for example, obtaining a CIDEr of 88.9 on Flickr30k, beating GPT-4V (55.3) and Gemini-1.5 Pro (82.2). We present results on several widely used benchmarks, but find that existing video understanding benchmarks are either question-answer-based (Song et al., 2023; Chen & Dolan, 2011; Caba Heilbron et al., 2015; Xu et al., 2016; Xiao et al., 2021; Fu et al., 2024; Wu et al., 2024), which cannot demonstrate detailed descriptive abilities, or they provide descriptions that are too short, with only a few words (Xu et al., 2016; Caba Heilbron et al., 2015) as shown in Table 1. We also conduct extensive ablation studies based on AURORACAP.
Researcher Affiliation Collaboration Wenhao Chai 1,2, Enxin Song Yilun Du 4 Chenlin Meng 2,3 Vashisht Madhavan 2 Omer Bar-Tal 2 Jenq-Neng Hwang 1 Saining Xie 5 Christopher D. Manning 3 1 University of Washington 2 Pika Labs 3 Stanford 4 Harvard 5 New York University
Pseudocode Yes Token Merging is applied between the attention and MLP within each transformer block as: 1. Alternatively partition the tokens into two sets A and B of roughly equal size. 2. For each token in set A, calculate the token similarity with each token in set B based on cosine similarity of the Key features in attention block. 3. Use bipartite soft matching and then select the most similar r pairs. 4. Merge the tokens using weighted average, record the token size. 5. Concatenate the two sets A and B back together again.
Open Source Code No The code for model training, evaluation, deployment, as well as model weight, benchmark, and training dataset of all the training stages will be released with the paper.
Open Datasets Yes Therefore, we develop VDC, a video detailed captioning benchmark with over one thousand carefully annotated structured captions. We utilize GPT-4o as our recaption assistant with a hierarchical prompt design. Panda-70M (Chen et al., 2024e) offers a high-resolution, opendomain You Tube video dataset with diverse one-minute clips across wildlife, cooking, sports, news, TV shows, gaming, and 3D rendering, ideal for studying complex real-world scenarios. We use the training data used in each stage are shown in Table G6, Table G7 and Table G8.
Dataset Splits No The paper lists datasets used for training (Tables G6, G7, G8) and evaluation (Appendix H) with the number of samples in each, but does not explicitly specify the training/test/validation splits for these datasets when used for AURORACAP's training. For example, it lists '1.3M LAION-CC-SBU-595K' for pretraining but not how this dataset is split into train/val/test for their model's training process.
Hardware Specification Yes For each input, we process the video at a resolution of 378 378 and sample 8 frames using single H100 GPU. Training costs in H100 hours
Software Dependencies No We are also thankful to Xtuner*, lmms-eval*, and SGLang* for their well-developed and user-friendly codebase.
Experiment Setup Yes G DETAILED TRAINING SETTINGS Training hyper-parameters for both stages are shown in Table G9.