Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction
Authors: Zheyuan Liu, Junyan Wang, Zicheng Duan, Cristian Rodriguez-Opazo, Anton van den Hengel
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive ablation studies on our design choices with quantitative and qualitative performance analysis. Our approach establishes a new baseline for the task of TVP. Our method, termed FCA, improves on previous work both quantitatively and qualitatively on standard benchmark datasets. Specifically, we achieve a 40% reduction (relative) in the FVD metric on both Something-Something-V2 (Goyal et al., 2017) and Epic Kitchen-100 (Damen et al., 2020) and an impressive 60% FVD reduction (relative) on Bridge Data (Ebert et al., 2021). |
| Researcher Affiliation | Academia | Zheyuan Liu EMAIL Center for Augmented Reasoning, Australian Institute for Machine Learning, University of Adelaide |
| Pseudocode | No | The paper includes architectural diagrams (Figure 1, 2, 3) and mathematical equations, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps. |
| Open Source Code | Yes | Our code is open-source at https: //github.com/Cuberick-Orion/FCA. |
| Open Datasets | Yes | We follow Gu et al. (2023) and test on three text-to-video datasets. Something-Something V2 (SSv2) (Goyal et al., 2017) contains 168,913 training samples on daily human behaviors of interacting with common objects. Bridge-Data-V1 (Ebert et al., 2021) includes 20,066 training samples in kitchen environments where robot arms manipulate objects. Epic-Kitchen-100 (Epic100) (Damen et al., 2020) consists of 67,217 training samples of human actions in first-person (egocentric) views. |
| Dataset Splits | Yes | For SSv2, we select the first 2,048 samples in the validation set; for Bridge Data, we split the dataset by 80% and 20% and use the latter for validation; finally, for Epic100, we directly adopt its validation split. |
| Hardware Specification | Yes | We conduct our experiments on four NVIDIA A100 40G GPUs. In practice, we observe the VRAM usage at approximately 40Gi B per GPU. |
| Software Dependencies | No | The paper mentions utilizing components like 'diffusers s pipeline', 'T5 text encoder', 'Vi T vision encoder', and 'CLIP text tokenizer' but does not specify their version numbers or other crucial software dependencies (e.g., Python version, PyTorch version). |
| Experiment Setup | Yes | We train our model for 100,000 steps on SSv2, 25,000 steps on Bridge Data, and 50,000 steps on Epic100. In inference, we set the CFG guidance scale ̸ = 6.0 and a sampling step of 50. Table 5: Hyper-parameters for fine-tuning. Height 480 Width 720 Number of frames 16 ... Batch size per GPU 1 Number of GPUs 4 Gradient accumulation 2 Effective batch size 8 ... Learning rate 0.001 Optimizer Adam W (̸1 = 0.9, ̸2 = 0.95) Gradient clip 1.0 Precision bfloat16. |