Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Image Editing As Programs with Diffusion Models

Authors: Yujia Hu, Songhua Liu, Zhenxiong Tan, Xingyi Yang, Xinchao Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments 5.1 Experimental Settings Training Settings. We train four specialized models for Ro I inpainting, Ro I editing, Ro I compositing, and global transformation respectively. All models are fine-tuned on FLUX.1-dev [31] using Lo RA [25], with default settings for rank 128 and alpha 128. Training is conducted with a batch size of 1 and runs for 50,000 iterations each. We use the Prodigy optimizer [39], enabling safeguard warmup and bias correction, with a weight decay of 0.01. The experiments are conducted on single NVIDIA H100 GPU (80GB). Dataset Setup. For both the Ro I editing and global transformation models, we sample from the relevant subsets of the Any Edit [73] dataset and apply GPT-4o [29] to filter the data of some types that have numerous noisy examples. To cover facial expression edits absent in Any Edit, we integrate the Celeb HQ-FM dataset [11], which offers consistent identities and annotated expressions suitable for our instruction schema. Evaluation Settings. We evaluate our method on two benchmarks: Magic Brush test set [76], a widely used dataset spanning diverse editing types, and Any Edit test set [73], from which we select 16 instruction-based editing categories. For Magic Brush, we follow previous works [76, 81, 15, 56] and report CLIPimg, CLIPout [22], L1, and DINO [7, 43] scores to measure the similarity between the generated results and ground-truth images. While for Any Edit, where some categories lack reference captions required for calculating CLIPout, we instead leverage GPT-4o [29] to rate each edited image on a scale from 1 to 5 across three dimensions: instruction faithfulness, semantic consistency, and aesthetic quality, with the final GPT score obtained by averaging the three aspect scores. We first compare our method with existing state-of-the-art open-source baselines, including Instruct Pix2Pix [6], Magic Brush [76], Ultra Edit [81], Gen Artist [64], Omni Gen2 [66] and ICEdit [80]. In addition, to demonstrate the competitiveness of our approach against powerful proprietary multimodal foundation models in complex image editing scenarios, we further make comparisons with Seed Edit (Doubao) [57], Gemini 2.0 Flash [19], and GPT-4o [29]. 5.2 Comparisons with State of the Art. Qualitative Comparisons. Fig. 5 shows the results of our approach against other six methods [6, 76, 81, 64, 66, 80] on some representative editing cases. Unlike previous methods, which sometimes misinterpret or fail to execute the given instructions, modify unintended regions, introduce undesired artifacts, or produce visually implausible results, our method consistently exhibits clear and consistent advantages in accurately following the instruction, maintaining structural coherence, preserving instance-level fidelity and retaining fine-grained visual details. Quantitative Comparisons. Table 1 exhibits the quantitative comparison results of our method and other approaches [6, 76, 81, 64, 66, 80] on Magic Brush test set [76] and Any Edit test set [73]. The results show that our method demonstrates state-of-the-art performance on both datasets. On 5.3 Ablation Studies Module-wise Ablation Studies. To quantify the impact of each key component in our framework, we perform a series of ablation studies on the Any Edit [73] local semantic editing test set as we split in Sec. 5.2. As shown in Tab. 3, we first substitute our Co T reasoning and reduction pipeline with end-to-end editing pipeline, resulting in a marked performance deterioration across all metrics. Next, we replace our specialized Ro I inpainting and Ro I editing models respectively with the generic inpainting model from [60], which induces performance declines of varying degrees. We then remove the LLM-guided layout reconfiguration and instead employing random layout modifications for relevant operations, which incurs a noticeable performance decline. Finally, omitting the annular
Researcher Affiliation	Academia	Yujia Hu1, Songhua Liu2,1, Zhenxiong Tan1, Xingyi Yang3,1, and Xinchao Wang1 1National University of Singapore 2School of Artificial Intelligence, Shanghai Jiao Tong University 3The Hong Kong Polytechnic University
Pseudocode	Yes	A Algorithm Illustration To better elaborate the details of the proposed IEAP, we provide an algorithmic illustration for the whole pipeline in Alg. 1. Algorithm 1 IEAP: Image Editing As Programs
Open Source Code	Yes	Codes are available here.
Open Datasets	Yes	Dataset Setup. For both the Ro I editing and global transformation models, we sample from the relevant subsets of the Any Edit [73] dataset and apply GPT-4o [29] to filter the data of some types that have numerous noisy examples. To cover facial expression edits absent in Any Edit, we integrate the Celeb HQ-FM dataset [11], which offers consistent identities and annotated expressions suitable for our instruction schema.
Dataset Splits	Yes	Evaluation Settings. We evaluate our method on two benchmarks: Magic Brush test set [76], a widely used dataset spanning diverse editing types, and Any Edit test set [73], from which we select 16 instruction-based editing categories.
Hardware Specification	Yes	The experiments are conducted on single NVIDIA H100 GPU (80GB).
Software Dependencies	No	All models are fine-tuned on FLUX.1-dev [31] using Lo RA [25], with default settings for rank 128 and alpha 128. Training is conducted with a batch size of 1 and runs for 50,000 iterations each. We use the Prodigy optimizer [39], enabling safeguard warmup and bias correction, with a weight decay of 0.01.
Experiment Setup	Yes	Training Settings. We train four specialized models for Ro I inpainting, Ro I editing, Ro I compositing, and global transformation respectively. All models are fine-tuned on FLUX.1-dev [31] using Lo RA [25], with default settings for rank 128 and alpha 128. Training is conducted with a batch size of 1 and runs for 50,000 iterations each. We use the Prodigy optimizer [39], enabling safeguard warmup and bias correction, with a weight decay of 0.01.