Boximator: Generating Rich and Controllable Motions for Video Synthesis

Authors: Jiawei Wang, Yuchen Zhang, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping Yuan, Hang Li

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, Boximator achieves state-of-the-art video quality (FVD) scores, improving on two base models, and further enhanced after incorporating box constraints. Its robust motion controllability is validated by drastic increases in the bounding box alignment metric. Human evaluation also shows that users favor Boximator generation results over the base model.
Researcher Affiliation Industry 1Byte Dance Research, Beijing, China. Correspondence to: Jiawei Wang, Yuchen Zhang <{wangjiawei.424, zhangyuchen.zyc}@bytedance.com>.
Pseudocode No The paper describes its architecture and procedures in text and diagrams (e.g., Figure 2) but does not include formal pseudocode or algorithm blocks.
Open Source Code No The paper states 'Check our website for more cases: https://boximator.github.io/' but this link is for demonstration cases and does not explicitly state that the source code for the described methodology is available there. There is no other concrete statement or link for code release.
Open Datasets Yes We curated our training set from the Web Vid-10M dataset (Bain et al., 2021)... We test our models using the MSR-VTT (Xu et al., 2016), Activity Net (Caba Heilbron et al., 2015) and UCF-101 (Soomro et al., 2012) datasets.
Dataset Splits No The paper mentions using a 'portion of the Activity Net validation set' and describes processing for evaluation datasets (MSR-VTT, UCF-101 test sets). However, it does not provide specific training/validation splits (e.g., percentages or exact counts) for its main curated training dataset from Web Vid-10M, which is needed to fully reproduce the data partitioning for the entire experiment.
Hardware Specification Yes The training uses the Adam optimizer, with a batch size of 128 across 16 NVIDIA Tesla A100 GPUs.
Software Dependencies No The paper mentions various software components and tools used, such as 'Adam optimizer', 'DDIM inference algorithm', 'LLaVA', 'spaCy', 'Grounding DINO', and 'DEVA object tracker'. However, it does not provide specific version numbers for these software dependencies, which are necessary for reproducible descriptions.
Experiment Setup Yes Our models train on 16-frame sequences with a resolution of 256x256 pixels, running at 4 frames per second. We limit the maximum number of objects to N = 8. The training uses the Adam optimizer, with a batch size of 128 across 16 NVIDIA Tesla A100 GPUs. As outlined in Section 4.4, training occurs in three stages: 50k iterations for stage 1, 50k iterations for stage 2, and 10k iterations for stage 3. We use 2 10 4 learning rate for the first stage, and 3 10 5 for later stages. All stages use linear learning rate scheduler with 7,500 warm-up steps. For all experiments, we use the DDIM inference algorithm (Song et al., 2020) with 50 inference steps. To enable classifierfree guidance, we construct negative conditions by substituting every control token with tnull. We set the classifier-free guidance scale to be 9.