Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PixPerfect: Seamless Latent Diffusion Local Editing with Discriminative Pixel-Space Refinement

Authors: Haitian Zheng, Yuan Yao, yongsheng yu, Yuqian Zhou, Jiebo Luo, Zhe Lin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on inpainting, object removal, and insertion benchmarks demonstrate that Pix Perfect substantially enhances perceptual fidelity and downstream editing performance, establishing a new standard for robust and high-fidelity localized image editing.
Researcher Affiliation Collaboration Haitian Zheng1 Yuan Yao2 Yongsheng Yu2 Yuqian Zhou1 Jiebo Luo2 Zhe Lin1 1Adobe Research 2University of Rochester EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes Code 1: A minimal demo script for reproducing the seam artifacts of Flux inpainting [3] model.
Open Source Code No The proposed method has been integrated into a commercial product and relies on proprietary code and datasets, so it will not be open-sourced at this time. However, we have provided detailed descriptions of the architecture, training pipeline, and evaluation settings in the main paper and supplementary material to support faithful reproduction of the main experimental results.
Open Datasets Yes We evaluate Pix Perfect on three major tasks inpainting, object removal, and object insertion. For inpainting, we follow prior works and use two standard datasets: Places2 [62] and MISATO [47]. ... For object removal, we use the RORDS dataset [39], which contains 500 image pairs with human-annotated foreground masks and corresponding clean background ground-truths.
Dataset Splits Yes For inpainting, we follow prior works and use two standard datasets: Places2 [62] and MISATO [47]. Places2 is a large-scale scene-centric dataset from which we randomly sample 2000 validation images and apply irregular masks of varying shapes and sizes to simulate occlusions. MISATO consists of 2000 512 512 images, each paired with a generated mask, specifically curated for evaluating semantic inpainting. For object removal, we use the RORDS dataset [39], which contains 500 image pairs with human-annotated foreground masks and corresponding clean background ground-truths.
Hardware Specification Yes Training is performed on a cluster of 32 NVIDIA A100 GPUs within one week. ... For example, when applied to a 512 512 image on a single NVIDIA A100 GPU, the diffusion sampling with FLUX-Fill [3] takes approximately 9.7 seconds, whereas our refiner adds only 2.7 seconds, accounting for only 21.8% of the total inference time.
Software Dependencies No The paper mentions software components like 'CMGAN architecture [61]', 'Adam', 'LPIPS [58]', 'Flux Fill Pipeline.from_pretrained', and `torch_dtype=torch.bfloat16`. However, it does not specify explicit version numbers for these software dependencies (e.g., PyTorch version, Diffusers library version, CUDA version).
Experiment Setup Yes Optimization uses Adam with a learning rate of 0.0005 and a batch size of 32. We interoperate larger perceptual and l1 weight, i.e. w1 = 64, w2 = 5, w3 = 1 to enforce color consistency. the perceptual loss is computed using LPIPS [58] following [12]. For the tone mapping function, the maximal polynomial degree is set to D = 5 to avoid overfitting. Training is performed on a cluster of 32 NVIDIA A100 GPUs within one week. ... The refiner is built on the CMGAN architecture [61]. ... R1 regularization with γ = 1 and utilizes the Co Mod GAN mask generation scheme [59] to generate random masks on-the-fly. During an initial warm-up phase, the discriminative pixel-space loss remains disabled. A constant learning rate of 5 10 4 is applied throughout the training.