Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MJ-Video: Benchmarking and Rewarding Video Generation with Fine-Grained Video Preference

Authors: Haibo Tong, Zhaoyang Wang, Zhaorun Chen, Haonian Ji, Shi Qiu, Siwei Han, Kexin Geng, Zhongkai Xue, Yiyang Zhou, Peng Xia, Mingyu Ding, Rafael Rafailov, Chelsea Finn, Huaxiu Yao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive benchmarking on MJ-BENCH-VIDEO, we analyze the limitations of existing video reward models and demonstrate the superior performance of MJ-VIDEO in video preference assessment, achieving 17.58% and 15.87% improvements in overall and fine-grained preference judgments, respectively.
Researcher Affiliation Academia 1UNC-Chapel Hill 2UIUC 3UChicago 4University of Oxford 5Stanford University
Pseudocode No The paper describes the methodology in prose and through architectural diagrams (Figure 3), but does not include a distinct pseudocode or algorithm block.
Open Source Code Yes Code and data can be found in supplemental materials. video demos to demonstrate (1) MJ-BENCH-VIDEO contains high quality video preference pairs at anonymous https://anonymous.4open.science/r/mj-video-neurips-364C/, and (2) MJ-VIDEO is able to improve video generation quality of Video Crafter-V2 [6] at anonymous https://anonymous.4open.science/r/mj-video-neurips-364C/video_demo.zip.
Open Datasets Yes We introduce MJBENCH-VIDEO, a large-scale video preference benchmark designed to evaluate video generation across five critical aspects: Alignment, Safety, Fineness, Coherence & Consistency, and Bias & Fairness. MJ-BENCH-VIDEO contains high quality video preference pairs at anonymous https://anonymous.4open.science/r/mj-video-neurips-364C/.
Dataset Splits Yes Dataset Split. We divide MJ-BENCH-VIDEO into a training set and a test set at a 4:1 ratio, leading to 4,336 training video pairs and 1,085 testing video pairs.
Hardware Specification No The paper discusses various models, training parameters, and evaluation metrics, but does not explicitly specify the hardware used for running the experiments (e.g., specific GPU or CPU models, or cloud computing instance types).
Software Dependencies No The paper mentions various models and optimizers (e.g., Adam W), but does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks used in the implementation.
Experiment Setup Yes Specifically, the first stage is to train the Criteria Mo E layer to predict the annotated fine-grained criteria scores. The second stage is to leverage aspect ranking information from preference pairs to train the Aspect Mo E layer. In the final stage, we integrate the previous training steps and introduce an overall preference ranking loss to jointly optimize both the aspect Mo E layer and the criteria Mo E layer. We detail the three-stage training as follows: Stage I: Criteria Scoring Training. We use the fine-grained annotated criteria scores s R28 as labels to train the Criteria Mo E layer, ensuring accurate judgment... The training follows a batch size of 64, a warmup step of 25, and a learning rate of 3e-5, with a cosine decay learning rate scheduler. We use Adam W as the optimizer and train on the criteria-level annotations from MJ-BENCH-VIDEO The model is trained for 3 epochs, totaling 201 steps.