Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FSI-Edit: Frequency and Stochasticity Injection for Flexible Diffusion-Based Image Editing

Authors: Kaixiang Yang, Xin Li, Yuxi Li, Qiang Li, Zhiwei Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To comprehensively evaluate our method on both rigid and non-rigid editing tasks, we conduct experiments on the PIE-Bench [15] benchmark, which contains 700 image-prompt pairs across 10 diverse editing categories. For assessing non-rigid editing performance specifically, such as object addition, deletion, and pose modification, we additionally curate a subset of 300 samples from PIE-Bench that emphasize such editing. Our comparisons include LDM-based methods (P2P [8], Pn P [9], Masa Ctrl [19], Flexi Edit [16], Free Diff [34]) and Di T-based approaches (RFInv [39], Stable Flow [40], RF-Edit [31], DCEdit [33]). All models are tested using their publicly available implementations and default configurations for fair comparison. Metrics. To holistically assess editing performance and background preservation of different methods, we employ six complementary metrics. Structure Distance [41] measure the structural similarity between edited images and original images, while PSNR, LPIPS [42], MSE and SSIM [43] collectively evaluate content preservation in unedited regions. For text-image consistency, we compute CLIP similarity [44] over both the entire image and the edited region. The dataset-provided masks are used to identify the edited regions, but only during evaluation. Implementation Details. All experiments for FSI-Edit-LDM were conducted using Latent Diffusion Model (LDM) [5] v1.5, while FSI-Edit-Di T was built upon Di T [7] v3.5-Medium. We use 50 DDIM steps for inversion, with a CFG scale of 1. During generation, the target branch uses a CFG scale of 7.5. The same settings are applied across both backbones. All experiments were run on a single NVIDIA RTX 4090 GPU with 17 GB memory usage. The full editing pipeline, including inversion and generation, takes 20 seconds per image. Our code is available at https: //github.com/kk42yy/FSI-Edit. 4.2 Comparisons on Diverse Editing Types Experimental results on the PIE-Bench are summarized in Table 1. Visual comparisons are shown in Figure 4 and Figure 5. These results are generated using our Di T-based version of FSI-Edit, additional examples and results for the LDM-based variant can be found in the Appendix. 4.5 Ablation Study Effects of Key Components. To assess the contribution of each core component in our method, we conduct ablation studies on the curated non-rigid editing subset of PIE-Bench.
Researcher Affiliation Academia Kaixiang Yang , Xin Li , Yuxi Li , Qiang Li, Zhiwei Wang Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology : Co-first authors, : Corresponding author. EMAIL
Pseudocode Yes Algorithm 1 FSI-Edit-LDM 1: Input: origin image x0, inversion steps T, denoising model ϵθ, source target prompts Psrc, Ptgt, res-block and self-attention thresholds τres and τself 2: Stage I: DDIM Inversion 3: for t = 1, , T do 4: xt = αt αt 1 xt 1 + (1 αt 1) αt αt 1 ϵθ(xt 1, t 1, Psrc) 5: end for 6: Get the inversion trajectory {xt}T t=1 7: Stage II: FSI Editing 8: xtar T = ITN(x T , x T 1) 9: for t = T, , 1 do 10: xt = ITN(xt, xt 1) 11: f res t,src, Qself t,src, Kself t,src, V self t,src ϵθ( xt, t, Ptgt) 12: if t > τres then 13: f res t,tgt = SNI FRF(f res t,src, f res t,tgt) 14: else 15: f res t,tgt = f res t,tgt 16: end if 17: if t > τself then 18: Qself t,tgt , Kself t,tgt = SNI FRF(Qself t,src, Kself t,src) ; V self t,tgt = V self t,src 19: else 20: Qself t,tgt , Kself t,tgt , V self t,tgt = Qself t,tgt, Kself t,tgt, V self t,tgt 21: end if 22: xtar t 1 = ϵθ(xtar t , t, Ptgt; f res t,tgt, Qself t,tgt , Kself t,tgt , V self t,tgt ) 23: xtar t 1 = DDIM-Samp(xtar t , xtar t 1) 24: end for 25: Output: Editing image xtar 0 --- Algorithm 2 FSI-Edit-Di T 1: Input: origin image x0, inversion steps T, velocity field vθ, source target prompts Psrc, Ptgt, cross-block and self-attention thresholds τcross and τself 2: Stage I: Rectified Flow Inversion 3: for t = 1, , T do 4: xt = xt 1 + (σt σt 1)vθ(xt 1, t 1, Psrc) 5: end for 6: Get the inversion trajectory {xt}T t=1 7: Stage II: FSI Editing 8: xtar T = ITN(x T , x T 1) 9: for t = T, , 1 do 10: xt = ITN(xt, xt 1) 11: (Qcross t,src , Kcross t,src , V cross t,src ), (Qself t,src, Kself t,src, V self t,src) vθ( xt, t, Ptgt) 12: if t > τcross then 13: Qcross t,tgt , Kcross t,tgt , V cross t,tgt = SNI Qcross t,src , Kcross t,src , V cross t,src 14: else 15: Qcross t,tgt , Kcross t,tgt , V cross t,tgt = Qcross t,tgt , Kcross t,tgt , V cross t,tgt 16: end if 17: if t > τself then 18: Qself t,tgt , Kself t,tgt = SNI FRF(Qself t,src, Kself t,src) ; V self t,tgt = V self t,src 19: else 20: Qself t,tgt , Kself t,tgt , V self t,tgt = Qself t,tgt, Kself t,tgt, V self t,tgt 21: end if 22: xtar t 1 = vθ(xtar t , t, Ptgt; Qcross t,tgt , Kcross t,tgt , V cross t,tgt , Qself t,tgt , Kself t,tgt , V self t,tgt ) 23: xtar t 1 = Rectified Flow-Samp(xtar t , xtar t 1) 24: end for 25: Output: Editing image xtar 0
Open Source Code Yes Our code is available at https: //github.com/kk42yy/FSI-Edit.
Open Datasets Yes To comprehensively evaluate our method on both rigid and non-rigid editing tasks, we conduct experiments on the PIE-Bench [15] benchmark, which contains 700 image-prompt pairs across 10 diverse editing categories. For assessing non-rigid editing performance specifically, such as object addition, deletion, and pose modification, we additionally curate a subset of 300 samples from PIE-Bench that emphasize such editing.
Dataset Splits No To comprehensively evaluate our method on both rigid and non-rigid editing tasks, we conduct experiments on the PIE-Bench [15] benchmark, which contains 700 image-prompt pairs across 10 diverse editing categories. For assessing non-rigid editing performance specifically, such as object addition, deletion, and pose modification, we additionally curate a subset of 300 samples from PIE-Bench that emphasize such editing. The paper mentions evaluating on a benchmark and a curated subset but does not specify explicit training/validation/test splits used for their method.
Hardware Specification Yes All experiments were run on a single NVIDIA RTX 4090 GPU with 17 GB memory usage. The full editing pipeline, including inversion and generation, takes 20 seconds per image. [...] All LDM-based methods above are implemented using the v1.4 or v1.5 Stable Diffusion backbone and are executed on a single NVIDIA RTX 4090 GPU with 24GB of memory. [...] The above three methods are executed on a single NVIDIA A100-PCIE-80GB GPU.
Software Dependencies No All experiments for FSI-Edit-LDM were conducted using Latent Diffusion Model (LDM) [5] v1.5, while FSI-Edit-Di T was built upon Di T [7] v3.5-Medium. The paper mentions specific versions of the Diffusion Models used as backbones (LDM v1.5, DiT v3.5-Medium), but it does not provide specific version numbers for general ancillary software like Python, PyTorch, CUDA, etc.
Experiment Setup Yes Implementation Details. All experiments for FSI-Edit-LDM were conducted using Latent Diffusion Model (LDM) [5] v1.5, while FSI-Edit-Di T was built upon Di T [7] v3.5-Medium. We use 50 DDIM steps for inversion, with a CFG scale of 1. During generation, the target branch uses a CFG scale of 7.5. The same settings are applied across both backbones. All experiments were run on a single NVIDIA RTX 4090 GPU with 17 GB memory usage. The full editing pipeline, including inversion and generation, takes 20 seconds per image. Our code is available at https: //github.com/kk42yy/FSI-Edit. [...] Specifically, we set the default values as follows: pairing distance d = 1 for ITN, fusion weight α = 0.2 and Gaussian scaling coefficient σ = 0.3 for FRF, noise ratio η = 0.2 (corresponding to σf = 0.8) for SNI, and FSI-Edit intervention durations of 50% and 65% for non-rigid and rigid editing tasks, respectively.