Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation

Authors: Harold Haodong Chen, Haojian Huang, Qifeng Chen, Harry Yang, Ser Nam Lim

NeurIPS 2025 | Venue PDF | LLM Run Details | Input Tokens: 28,561 Total number of tokens sent to the LLM as input for this paper's analysis. | Output Tokens: 4,020 Total number of tokens produced by the LLM (including reasoning/thinking tokens) for this paper's analysis.

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on both physics-focused and general capability benchmarks demonstrate that Phys HPO significantly improves physical plausibility and overall video generation quality of advanced models.
Researcher Affiliation Collaboration Harold Haodong Chen1,2,3, Haojian Huang1,2, Qifeng Chen2,3, Harry Yang 2,3, Ser-Nam Lim 3,4 1The Hong Kong University of Science and Technology (Guangzhou) 2The Hong Kong University of Science and Technology 3Everlyn AI, 4University of Central Florida
Pseudocode No The paper describes methodologies using mathematical equations (e.g., Eq.(1) to (9)) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Answer: [No] Justification: The complete code, based on the primary model Cog Video X, is provided anonymously. The data used is from the open-source dataset Open Vid.
Open Datasets Yes Given a large-scale high-quality text-video dataset D (we adopt Open Vid HD-0.4M [57] in this work, which is widely utilized for post-training [12, 87, 68, 13])
Dataset Splits No The paper mentions training on 'our selected dataset' and evaluates on 'physics-focused (i.e., Video Phy [6], Phy Gen Bench [54]) and general capability (i.e., VBench [34]) benchmarks', but does not explicitly provide training/test/validation splits for its own selected dataset (21K samples).
Hardware Specification Yes All experiments are conducted on 8 NVIDIA H100 GPUs.
Software Dependencies No The paper mentions using specific models like Qwen2.5-VL [4], Deep Seek-VL2 [79], Qwen2.5 [86], and LLa MA-1 13B [71], and the Adam W optimizer, but does not specify version numbers for any underlying software libraries (e.g., PyTorch, TensorFlow, CUDA).
Experiment Setup Yes We train base models on our selected dataset with a global batch size of 8, using the Adam W optimizer and a learning rate of 2e 5. Instance-level non-preferred weights are set to βerr = 0.7 and βgap = 0.3, with N = 2 for state-level samples. The loss weights λ, ρ, and µ are set to 0.4, 0.3, and 0.2, respectively.