Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation

Authors: Agneet Chatterjee, Rahim Entezari, Maksym Zhuravinskyi, Maksim Lapin, Reshinth Adithyan, Amit Raj, Chitta Baral, Yezhou Yang, Varun Jampani

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a large-scale human study spanning 10+ models and 20K videos, annotated by a pool of 80+ film professionals. Our analysis, both coarse and fine-grained reveal that even the strongest current models exhibit significant gaps, particularly in Events and Camera-related controls.
Researcher Affiliation Collaboration Agneet Chatterjee1,2, Rahim Entezari1 Maksym Zhuravinskyi1 Max Lapin1 Reshinth Adithyan1 Amit Raj3 Chitta Baral2 Yezhou Yang2 Varun Jampani1 1Stability AI 2Arizona State University 3Google Deep Mind
Pseudocode No The paper describes methodologies and taxonomies but does not present structured pseudocode or algorithm blocks.
Open Source Code No The paper states in the NeurIPS Paper Checklist, section 5, that 'We will release our proposed taxonomy upon acceptance.' This refers to the taxonomy itself, not explicitly the open-source code for the experimental methodology such as the trained VLM or prompt generation pipeline described.
Open Datasets No The paper states, 'Using these taxonomies, we construct a benchmark of prompts aligned with professional use cases...' and 'The SCINE benchmark comprises two prompt categories, Scripts and Visuals, each aligned with distinct professional roles (Table 2). The Visuals prompts are created by systematically upsampling Scripts using our taxonomies leading to a total of 2,089 prompts.' However, there is no explicit statement or link in the paper indicating public availability of this benchmark dataset (prompts).
Dataset Splits Yes We adopt Qwen-2.5-VL-7B [53] as the base model for finetuning; our training and validation dataset consist of 44,062 and 12,763 samples, respectively.
Hardware Specification No The paper's NeurIPS checklist states that compute resources are reported in the Appendix, but specific hardware details such as GPU/CPU models or memory specifications used for running the experiments are not explicitly provided in the main text or the appendices.
Software Dependencies Yes We adopt Qwen-2.5-VL-7B [53] as the base model for finetuning; ... We use Qwen2.5-VL-Instruct models due to their strong video understanding capabilities. To study the effect of model scale, we evaluate 3 sizes: 7, 32 and 72B.
Experiment Setup Yes The model is trained for 1 epoch with a batch size of 8 and learning rate 6e-5.