FIFO-Diffusion: Generating Infinite Videos from Text without Training

Authors: Jihwan Kim, Junoh Kang, Jinyoung Choi, Bohyung Han

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have demonstrated the promising results and effectiveness of the proposed methods on existing text-to-video generation baselines. and This section presents the examples generated by existing long video generation methods including FIFO-Diffusion, and evaluates their performance qualitatively and quantitatively.
Researcher Affiliation Academia Jihwan Kim 1 Junoh Kang 1 Jinyoung Choi1 Bohyung Han1,2 Computer Vision Laboratory, 1ECE & 2IPAI, Seoul National University {kjh26720,junoh.kang, jin0.choi, bhhan}@snu.ac.kr
Pseudocode Yes Algorithm 1 FIFO-Diffusion with diagonal denoising (Section 3.1) ... Algorithm 4 FIFO-Diffusion with lookahead denoising (Section 3.3)
Open Source Code Yes Generated video examples and source codes are available at our project page1. 1https://jjihwan.github.io/projects/FIFO-Diffusion.
Open Datasets Yes For quantitative evaluation, we measure FVD128 [27] and IS [21] scores using Latte [13] as a base model, which is a Di T-based video model trained on UCF-101 [26].
Dataset Splits No The paper uses pretrained models and describes the generation of videos for evaluation (e.g., 'generate 2,048 videos with 128 frames each') but does not specify explicit train/validation/test dataset splits used in their experiments.
Hardware Specification Yes We adopt Video Crafter2 as the baseline model, using a DDPM scheduler with 64 inference steps on A6000 GPUs.
Software Dependencies No The paper mentions software like Video Crafter1, Video Crafter2, zeroscope, Open-Sora Plan, La Vie, and SEINE, and uses DDIM sampling, but does not provide specific version numbers for these software components or libraries.
Experiment Setup Yes We employ the DDIM sampling [24] with η {0.5, 1}. and We empirically choose n = 4 for the number of partitions in latent partitioning and lookahead denoising. and Table 4 which lists specific parameters like f, n, η, # Prompts, # Frames, Resolution for various experiments.