Boosting Text-to-Video Generative Model with MLLMs Feedback

Authors: Xun Wu, Shaohan Huang, Guolong Wang, Jing Xiong, Furu Wei

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our comprehensive experiments confirm the effectiveness of both VIDEOPREFER and VIDEORM, representing a significant step forward in the field.
Researcher Affiliation Collaboration Xun Wu1, Shaohan Huang1B, Guolong Wang2, Jing Xiong3, Furu Wei1 1 Microsoft Research Asia, 2 University of International Business and Economics 3 The University of Hong Kong
Pseudocode Yes Algorithm 1 DRa FT-V: Reward Reinforcement Learning for Fine-tuning Text-to-Video Models with VIDEORM
Open Source Code No We will public our data and code upon paper acceptance, due to the management regulations of our institution.
Open Datasets Yes VIDEOPREFER, which includes 135,000 preference annotations. Utilizing this dataset, we introduce VIDEORM, the first general-purpose reward model tailored for video preference in the text-to-video domain. Our comprehensive experiments confirm the effectiveness of both VIDEOPREFER and VIDEORM, representing a significant step forward in the field.
Dataset Splits No The paper does not explicitly provide specific training/validation/test split percentages or sample counts for its own generated dataset (VIDEOPREFER) or for how it samples from the mixture of existing datasets for training.
Hardware Specification Yes All VIDEORM series models are trained in half-precision on 8 32GB NVIDIA V100 GPUs
Software Dependencies No The paper mentions software like PyTorch and CLIP but does not provide specific version numbers for these or any other key software components.
Experiment Setup Yes All VIDEORM series models are trained in half-precision on 8 32GB NVIDIA V100 GPUs, with a learning rate of 1e-5 and batch size of 64 in total. We set the input frames N = 8.