FIFO-Diffusion: Generating Infinite Videos from Text without Training
Authors: Jihwan Kim, Junoh Kang, Jinyoung Choi, Bohyung Han
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We have demonstrated the promising results and effectiveness of the proposed methods on existing text-to-video generation baselines. and This section presents the examples generated by existing long video generation methods including FIFO-Diffusion, and evaluates their performance qualitatively and quantitatively. |
| Researcher Affiliation | Academia | Jihwan Kim 1 Junoh Kang 1 Jinyoung Choi1 Bohyung Han1,2 Computer Vision Laboratory, 1ECE & 2IPAI, Seoul National University {kjh26720,junoh.kang, jin0.choi, bhhan}@snu.ac.kr |
| Pseudocode | Yes | Algorithm 1 FIFO-Diffusion with diagonal denoising (Section 3.1) ... Algorithm 4 FIFO-Diffusion with lookahead denoising (Section 3.3) |
| Open Source Code | Yes | Generated video examples and source codes are available at our project page1. 1https://jjihwan.github.io/projects/FIFO-Diffusion. |
| Open Datasets | Yes | For quantitative evaluation, we measure FVD128 [27] and IS [21] scores using Latte [13] as a base model, which is a Di T-based video model trained on UCF-101 [26]. |
| Dataset Splits | No | The paper uses pretrained models and describes the generation of videos for evaluation (e.g., 'generate 2,048 videos with 128 frames each') but does not specify explicit train/validation/test dataset splits used in their experiments. |
| Hardware Specification | Yes | We adopt Video Crafter2 as the baseline model, using a DDPM scheduler with 64 inference steps on A6000 GPUs. |
| Software Dependencies | No | The paper mentions software like Video Crafter1, Video Crafter2, zeroscope, Open-Sora Plan, La Vie, and SEINE, and uses DDIM sampling, but does not provide specific version numbers for these software components or libraries. |
| Experiment Setup | Yes | We employ the DDIM sampling [24] with η {0.5, 1}. and We empirically choose n = 4 for the number of partitions in latent partitioning and lookahead denoising. and Table 4 which lists specific parameters like f, n, η, # Prompts, # Frames, Resolution for various experiments. |