TokenFlow: Consistent Diffusion Features for Consistent Video Editing
Authors: Michal Geyer, Omer Bar-Tal, Shai Bagon, Tali Dekel
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate state-of-the-art editing results on a variety of real-world videos. and We evaluate our method on DAVIS videos (Pont-Tuset et al., 2017) and on Internet videos depicting animals, food, humans, and various objects in motion. The spatial resolution of the videos is 384 672 or 512 512 pixels, and they consist of 40 to 200 frames. We use various text prompts on each video to obtain diverse editing results. Our evaluation dataset comprises of 61 text-video pairs. We utilize Pn P-Diffusion (Tumanyan et al., 2023) as the frame editing method, and we use the same hyper-parameters for all our results. |
| Researcher Affiliation | Academia | Anonymous authors Paper under double-blind review. The paper explicitly states "Anonymous authors Paper under double-blind review", which means no affiliation information is provided to maintain anonymity during the review process. |
| Pseudocode | Yes | Algorithm 1 Token Flow editing |
| Open Source Code | No | The paper mentions "we refer the reader to our webpage and SM for more examples and full-video results" and "We refer the reader to the HTML file attached to our Supplementary Material for video results." but does not explicitly state that the source code for their method is released or provide a link to a repository for their implementation. |
| Open Datasets | Yes | We evaluate our method on DAVIS videos (Pont-Tuset et al., 2017) and on Internet videos depicting animals, food, humans, and various objects in motion. |
| Dataset Splits | No | The paper states: "Our evaluation dataset comprises of 61 text-video pairs." and "We evaluate our method on DAVIS videos... and on Internet videos...", but it does not provide specific details on how these videos are partitioned into training, validation, or test splits. The method leverages a pre-trained model and does not involve training on these specific video datasets. |
| Hardware Specification | No | The paper does not explicitly provide specific hardware details such as GPU models, CPU models, or memory used for running the experiments. It only lists runtimes in Table 3 without hardware context. |
| Software Dependencies | No | The paper states: "We use Stable Diffusion as our pre-trained text-to-image model; we use the Stable Diffusion-v-2-1 checkpoint provided via official Hugging Face webpage." However, it does not list other ancillary software dependencies like programming languages or deep learning frameworks with specific version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | In all of our experiments, we use DDIM deterministic sampling with 50 steps. For inverting the video, we follow Tumanyan et al. (2023) and use DDIM inversion with classifierfree guidance scale of 1 and 1000 forward steps; For sampling the edited video we set the classifier-free guidance scale to 7.5. At each timestep, we sample random keyframes in frame intervals of 8. |