Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Enhancing Consistency of Flow-Based Image Editing through Kalman Control
Authors: Haozhe Chi, Zhicheng Sun, Yang Jin, Yi Ma, Jing Wang, Yadong Mu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on several datasets demonstrate its superior performance compared to previous state-of-the-art methods. 4 Experiments 4.1 Evaluation protocols Datasets. The experimental evaluation is conducted across four widely used datasets: SFHQ [3], HQ [19], ZONE [29] and DIV2K [1]. Metrics. Six metrics are employed to evaluate both editing quality and consistency. For editing quality, CLIP-T [40] is adopted to measure the semantic adherence between edited image and input prompts. Meanwhile, we also use Face Rec. metric to quantify identity similarity on the face-specific SFHQ dataset. Regarding editing consistency, CLIP-I and DINO [7] measure high-level semantic similarity, while LPIPS [57] captures low-level similarity such as pixel-level details. Moreover, Dreamsim [13] is responsible for evaluating mid-level similarity, including image layout. Quantitative results. As demonstrated in Table 1 and Table 2, our method maintains high facial similarity after editing on the SFHQ dataset and effectively adheres to complex editing prompts on the HQ dataset. Ablation analysis. In this section, we conduct ablation experiments to determine the optimal filter strength and steps at which the filter is applied, as well as to highlight the importance of Kalman control in structural preservation. |
| Researcher Affiliation | Collaboration | Haozhe Chi 1 Zhicheng Sun 1 Yang Jin 1 Yi Ma 2 Jing Wang 2 Yadong Mu 1 1Peking University, 2Central Media Technology Institute, Huawei |
| Pseudocode | Yes | Algorithm 1 Kalman-Edit and Kalman-Edit Algorithm 2 Detailed procedure of Kalman-Edit |
| Open Source Code | Yes | To facilitate further open research into its practical uses and any potential societal impacts, our code would be open sourced at https://github.com/anonymous-138384/Kalman-Edit-Pytorch/. |
| Open Datasets | Yes | The experimental evaluation is conducted across four widely used datasets: SFHQ [3], HQ [19], ZONE [29] and DIV2K [1]. |
| Dataset Splits | No | Due to practical computational constraints, we evaluate our approach on the subsets of these benchmarks, including 1,200 images from SFHQ, 320 images from HQ, and 105 images from ZONE and DIV2K. |
| Hardware Specification | Yes | All of our experiments are conducted on a single NVIDIA A40 GPU. |
| Software Dependencies | Yes | For flow-based editing, we use FLUX.1 dev [4] with N = 28 sampling steps. For diffusion-based editing, we use Stable Diffusion 1.4 [41] with N = 50 sampling steps. |
| Experiment Setup | Yes | Implementation details. For flow-based editing, we use FLUX.1 dev [4] with N = 28 sampling steps. For diffusion-based editing, we use Stable Diffusion 1.4 [41] with N = 50 sampling steps. The measurement length l is set to 14, and ฮด is 6 by default. More details are provided in Appendix B. Table 6: Experiment hyperparameters. Steps Base model CFG scale Control strength SDEdit 50 SD 1.4 4.0 P2P 50 SD 1.4 7.5 Masa Ctrl 50 SD 1.4 7.5 DDPM-Inv 50 SD 1.4 9 RF-Edit 28 FLUX.1 dev 2 RF-Inversion 28 FLUX.1 dev 3.5 (0.7, 0.95) Flow Edit 28 FLUX.1 dev (1.5,5.5) Flow Chef 28 FLUX.1 dev 2 Ours 28 FLUX.1 dev 3.5 0.95 Next, we explain the hyperparameter settings for our proposed method. We set the steps to add the Kalman filter l to 14, which is half of the total steps. And we set the steps L to be the later half steps of the generation (i.e., steps 15 to 28). The hyperparameter ฮด determining the two types of measurement sequences is 6 by default. The coefficient hyperparameters ยต and ฮป are set to 0.7 and 0.1 by default. |