Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FIFO-Diffusion: Generating Infinite Videos from Text without Training

Authors: Jihwan Kim, Junoh Kang, Jinyoung Choi, Bohyung Han

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have demonstrated the promising results and effectiveness of the proposed methods on existing text-to-video generation baselines. and This section presents the examples generated by existing long video generation methods including FIFO-Diffusion, and evaluates their performance qualitatively and quantitatively.
Researcher Affiliation Academia Jihwan Kim 1 Junoh Kang 1 Jinyoung Choi1 Bohyung Han1,2 Computer Vision Laboratory, 1ECE & 2IPAI, Seoul National University EMAIL
Pseudocode Yes Algorithm 1 FIFO-Diffusion with diagonal denoising (Section 3.1) ... Algorithm 4 FIFO-Diffusion with lookahead denoising (Section 3.3)
Open Source Code Yes Generated video examples and source codes are available at our project page1. 1https://jjihwan.github.io/projects/FIFO-Diffusion.
Open Datasets Yes For quantitative evaluation, we measure FVD128 [27] and IS [21] scores using Latte [13] as a base model, which is a Di T-based video model trained on UCF-101 [26].
Dataset Splits No The paper uses pretrained models and describes the generation of videos for evaluation (e.g., 'generate 2,048 videos with 128 frames each') but does not specify explicit train/validation/test dataset splits used in their experiments.
Hardware Specification Yes We adopt Video Crafter2 as the baseline model, using a DDPM scheduler with 64 inference steps on A6000 GPUs.
Software Dependencies No The paper mentions software like Video Crafter1, Video Crafter2, zeroscope, Open-Sora Plan, La Vie, and SEINE, and uses DDIM sampling, but does not provide specific version numbers for these software components or libraries.
Experiment Setup Yes We employ the DDIM sampling [24] with η {0.5, 1}. and We empirically choose n = 4 for the number of partitions in latent partitioning and lookahead denoising. and Table 4 which lists specific parameters like f, n, η, # Prompts, # Frames, Resolution for various experiments.