Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models
Authors: Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, Ziwei Liu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address this, we introduce Shot Bench, a comprehensive benchmark specifically designed for cinematic language understanding. It features over 3.5k expert-annotated QA pairs from images and video clips, meticulously curated from over 200 acclaimed (predominantly Oscar-nominated) films and spanning eight key cinematography dimensions. Our evaluation of 24 leading VLMs on Shot Bench reveals their substantial limitations: even the top-performing model achieves less than 60% average accuracy, particularly struggling with fine-grained visual cues and complex spatial reasoning. To catalyze advancement in this domain, we construct Shot QA, a large-scale multimodal dataset comprising approximately 70k cinematic QA pairs. Leveraging Shot QA, we develop Shot VL through supervised fine-tuning and Group Relative Policy Optimization. Shot VL significantly outperforms all existing open-source and proprietary models on Shot Bench, establishing new state-of-the-art performance. |
| Researcher Affiliation | Academia | 1Tongji University, 2Shanghai Artificial Intelligence Laboratory, 3The Chinese University of Hong Kong, 4S-Lab, Nanyang Technological University |
| Pseudocode | No | The paper describes the GRPO method using mathematical equations (1), (2), and an objective function, but it does not present these in a structured pseudocode or algorithm block format. |
| Open Source Code | Yes | We open-source our models, data, and code to foster rapid progress in this crucial area of AI-driven cinematic understanding and generation. |
| Open Datasets | Yes | We open-source our models, data, and code to foster rapid progress in this crucial area of AI-driven cinematic understanding and generation. |
| Dataset Splits | Yes | We use around 60k samples for SFT and approximately 8k samples for GRPO. ... For fast exploration, we sample approximately 4k images for the SFT stage and around 1k for GRPO. |
| Hardware Specification | Yes | The training process of SFT is performed on 4 Nvidia A100 GPUs and the GRPO process is performed on 8 Nvidia A100 GPUs. |
| Software Dependencies | Yes | Our implementation is based on ms-swift [64]. We use Flash Attention-2 [10] as the model s attention implementation and bfloat16 precision for both training and inference to reduce memory consumption. |
| Experiment Setup | Yes | In SFT stage, the global batch size is set to 4, and the model is trained for 1 epoch with a learning rate of 1e-5. In GRPO stage, we set the group size G to 12 and the global batch size to 24. The clipping parameter ϵ is set to 0.2. The model is trained for 10 epochs with a learning rate of 1 10 6. Detailed hyper-parameters are provided in the Appendix B. |