Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation

Authors: Harold Haodong Chen, Haojian Huang, Qifeng Chen, Harry Yang, Ser Nam Lim

NeurIPS 2025 | Venue PDF | LLM Run Details | Input Tokens: 28,561 Total number of tokens sent to the LLM as input for this paper's analysis. | Output Tokens: 4,020 Total number of tokens produced by the LLM (including reasoning/thinking tokens) for this paper's analysis.

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on both physics-focused and general capability benchmarks demonstrate that Phys HPO significantly improves physical plausibility and overall video generation quality of advanced models.
Researcher Affiliation	Collaboration	Harold Haodong Chen1,2,3, Haojian Huang1,2, Qifeng Chen2,3, Harry Yang 2,3, Ser-Nam Lim 3,4 1The Hong Kong University of Science and Technology (Guangzhou) 2The Hong Kong University of Science and Technology 3Everlyn AI, 4University of Central Florida
Pseudocode	No	The paper describes methodologies using mathematical equations (e.g., Eq.(1) to (9)) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Answer: [No] Justification: The complete code, based on the primary model Cog Video X, is provided anonymously. The data used is from the open-source dataset Open Vid.
Open Datasets	Yes	Given a large-scale high-quality text-video dataset D (we adopt Open Vid HD-0.4M [57] in this work, which is widely utilized for post-training [12, 87, 68, 13])
Dataset Splits	No	The paper mentions training on 'our selected dataset' and evaluates on 'physics-focused (i.e., Video Phy [6], Phy Gen Bench [54]) and general capability (i.e., VBench [34]) benchmarks', but does not explicitly provide training/test/validation splits for its own selected dataset (21K samples).
Hardware Specification	Yes	All experiments are conducted on 8 NVIDIA H100 GPUs.
Software Dependencies	No	The paper mentions using specific models like Qwen2.5-VL [4], Deep Seek-VL2 [79], Qwen2.5 [86], and LLa MA-1 13B [71], and the Adam W optimizer, but does not specify version numbers for any underlying software libraries (e.g., PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	We train base models on our selected dataset with a global batch size of 8, using the Adam W optimizer and a learning rate of 2e 5. Instance-level non-preferred weights are set to βerr = 0.7 and βgap = 0.3, with N = 2 for state-level samples. The loss weights λ, ρ, and µ are set to 0.4, 0.3, and 0.2, respectively.