Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction

Authors: Zheyuan Liu, Junyan Wang, Zicheng Duan, Cristian Rodriguez-Opazo, Anton van den Hengel

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive ablation studies on our design choices with quantitative and qualitative performance analysis. Our approach establishes a new baseline for the task of TVP. Our method, termed FCA, improves on previous work both quantitatively and qualitatively on standard benchmark datasets. Specifically, we achieve a 40% reduction (relative) in the FVD metric on both Something-Something-V2 (Goyal et al., 2017) and Epic Kitchen-100 (Damen et al., 2020) and an impressive 60% FVD reduction (relative) on Bridge Data (Ebert et al., 2021).
Researcher Affiliation Academia Zheyuan Liu EMAIL Center for Augmented Reasoning, Australian Institute for Machine Learning, University of Adelaide
Pseudocode No The paper includes architectural diagrams (Figure 1, 2, 3) and mathematical equations, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code Yes Our code is open-source at https: //github.com/Cuberick-Orion/FCA.
Open Datasets Yes We follow Gu et al. (2023) and test on three text-to-video datasets. Something-Something V2 (SSv2) (Goyal et al., 2017) contains 168,913 training samples on daily human behaviors of interacting with common objects. Bridge-Data-V1 (Ebert et al., 2021) includes 20,066 training samples in kitchen environments where robot arms manipulate objects. Epic-Kitchen-100 (Epic100) (Damen et al., 2020) consists of 67,217 training samples of human actions in first-person (egocentric) views.
Dataset Splits Yes For SSv2, we select the first 2,048 samples in the validation set; for Bridge Data, we split the dataset by 80% and 20% and use the latter for validation; finally, for Epic100, we directly adopt its validation split.
Hardware Specification Yes We conduct our experiments on four NVIDIA A100 40G GPUs. In practice, we observe the VRAM usage at approximately 40Gi B per GPU.
Software Dependencies No The paper mentions utilizing components like 'diffusers s pipeline', 'T5 text encoder', 'Vi T vision encoder', and 'CLIP text tokenizer' but does not specify their version numbers or other crucial software dependencies (e.g., Python version, PyTorch version).
Experiment Setup Yes We train our model for 100,000 steps on SSv2, 25,000 steps on Bridge Data, and 50,000 steps on Epic100. In inference, we set the CFG guidance scale ̸ = 6.0 and a sampling step of 50. Table 5: Hyper-parameters for fine-tuning. Height 480 Width 720 Number of frames 16 ... Batch size per GPU 1 Number of GPUs 4 Gradient accumulation 2 Effective batch size 8 ... Learning rate 0.001 Optimizer Adam W (̸1 = 0.9, ̸2 = 0.95) Gradient clip 1.0 Precision bfloat16.