Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach
Authors: Yunuo Chen, Junli Cao, Vidit Goel, Sergei Korolev, Chenfanfu Jiang, Jian Ren, Sergey Tulyakov, Anil Kag
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we demonstrate the effectiveness of our proposed method through both qualitative and quantitative evaluations. Our fine-tuning approach can be seamlessly integrated into different base video diffusion models. In our experiment, we adopt two baseline image-to-video models with distinct architectures: a UNet-based model, I2VGen-XL [46], and a Di T-based model, Cog Video X 1.5 [43]. We refer readers to our supplementary material for more details on the training setup. |
| Researcher Affiliation | Collaboration | Yunuo Chen1,2 , Junli Cao1,2, Vidit Goel2, Sergei Korolev2, Chenfanfu Jiang1, Jian Ren2, Sergey Tulyakov2, Anil Kag2 1University of California, Los Angeles, 2Snap Inc. |
| Pseudocode | Yes | Algorithm 1 Post-processing for Point Tracking Require: Tracked points P RT N 3, Foreground mask M {0, 1}H W , Resolution (H, W) Ensure: Processed tensor P RT H W 3 |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The authors will release the dataset and generation pipeline upon acceptance. |
| Open Datasets | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The authors will release the dataset and generation pipeline upon acceptance. |
| Dataset Splits | Yes | To quantitatively evaluate the performance of our model, we adopt a test batch of randomly selected images. The test images are sampled from a single batch of video clips (unseen by our model) from the video dataset [9], with the initial frame randomly selected within each clip, totaling 372 images. |
| Hardware Specification | Yes | Model Training We implement our pipeline on two baseline models: UNet-based model I2VGen XL [46] and Di T-based model Cog Video X 1.5-5B [43]. For our UNet model, we use a resolution of 448 256 with 16 frames; for our Di T model, we use a resolution of 1360 768 with 48 frames. During finetuning, we train on 8 NVIDIA A100 80G GPUs with a batch size of 4 for UNet model and a batch size of 1 for Di T model. |
| Software Dependencies | No | We extract the first frame as a reference image and perform semantic segmentation using Grounded-SAM-2 [33, 32], yielding masks for the foreground objects. To streamline the process, we utilize a language model, LMDeploy [10], to infer the main moving objects in the scene based on the video s caption. For example, the caption A man swinging a baseball bat in a studio results in {man, baseball bat}. This object-level label is subsequently fed to Grounded-SAM2 for prompt-based segmentation. Point Tracking Having obtained a reference frame and its segmentation, we apply Spa Tracker [41] to track pixel movements accordingly. |
| Experiment Setup | Yes | Model Training We implement our pipeline on two baseline models: UNet-based model I2VGen XL [46] and Di T-based model Cog Video X 1.5-5B [43]. For our UNet model, we use a resolution of 448 256 with 16 frames; for our Di T model, we use a resolution of 1360 768 with 48 frames. During finetuning, we train on 8 NVIDIA A100 80G GPUs with a batch size of 4 for UNet model and a batch size of 1 for Di T model. Joint Optimization To effectively inject 3D-awareness into the video model without degrading the RGB modality (see subsection 4.3 for details), we adopt a joint optimization strategy combining diffusion and regularization losses. In our experiments, we observe that when the regularization loss significantly outweighs the diffusion loss (by orders of magnitude), it tends to suppress motion. Conversely, when the diffusion loss dominates training, non-physical artifacts reappear. In practice, we balance the two losses to maintain comparable magnitudes, which consistently reduces reconstruction misalignment and noticeably mitigates non-physical artifacts. To further preserve the motion magnitute, we apply the regularization loss less frequently (once every k iterations, we use k = 5). |