Compositional 3D-aware Video Generation with LLM Director
Authors: Hanxin Zhu, Tianyu He, Anni Tang, Junliang Guo, Zhibo Chen, Jiang Bian
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our method can generate high-fidelity videos from text with diverse motion and flexible control over each concept. |
| Researcher Affiliation | Collaboration | Hanxin Zhu1 , Tianyu He2, Anni Tang3, Junliang Guo2, Zhibo Chen1, Jiang Bian2 1University of Science and Technology of China 2Microsoft Research Asia 3Shanghai Jiao Tong University |
| Pseudocode | No | The paper describes its methods using prose and mathematical equations but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | Project page: https://aka.ms/c3v. The NeurIPS Paper Checklist states: "The paper proposes a novel paradigm for text-to-video generation, named C3V. We plan to release the model s code along with comprehensive documentation to facilitate its use and replication." This indicates code is planned for release, not currently available. |
| Open Datasets | Yes | We use Lucid Dreamer [60], Human Gaussian [61] and Motion-X [56] to generate 3D scenes, humanoid objects and motions respectively. Motion-X [56]. Motion-X is a large-scale 3D expressive whole-body human motion dataset which comprises 15.6M precise 3D whole-body pose annotations (i.e., SMPL-X) covering 81.1K motion sequences from massive scenes with sequence-level semantic labels. |
| Dataset Splits | No | The paper mentions 'training' but does not explicitly provide details about a 'validation' dataset split or its purpose within the experimental setup. |
| Hardware Specification | Yes | All experiments are conducted using a single NVIDIA A100 GPU. |
| Software Dependencies | No | To realize SDS, we utilize Stable Diffusion [7] as the image diffusion model. (No version numbers provided for software dependencies). |
| Experiment Setup | Yes | During the process of multi-modal LLMs-based trajectory estimation, we use 20 locations by default to indicate the trajectory between the starting point and the ending point, (i.e., N = 20 in Eq. 4), where the path between adjacent locations is assumed as a straight line. For scale refinement (Eq. 6), τs is set to 0.1. For location refinement (Eq. 7), we apply it to refine all the twenty locations. The training iterations for each location is 1000 and τL is set to 0.1. |