Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
NEP: Autoregressive Image Editing via Next Editing Token Prediction
Authors: Huimin Wu, Xiaojian (Shawn) Ma, Haozhe Zhao, Yanpeng Zhao, Qing Li
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our framework on the image editing and text-to-image generation tasks. Firstly, we introduce the full training setup that trains the RLlama Gen and NEP stage-by-stage ( 3.1). Secondly, we evaluate NEP for image editing and validate its design choices from various aspects ( 3.2). Then, we demonstrate the results of NEP pre-training model RLlama Gen ( 3.3). Finally, we showcase the test-time scaling behaviors ( 3.4). |
| Researcher Affiliation | Academia | Huimin Wu1 Xiaojian Ma1 Haozhe Zhao2 Yanpeng Zhao1 Qing Li1 1State Key Laboratory of General Artificial Intelligence, BIGAI 2Peking University Project website: nep-bigai.github.io |
| Pseudocode | No | The paper describes methods and processes in text and with diagrams (e.g., Figure 2: Overview of Next-Editing-token Prediction), but it does not contain a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | The code will be released upon acceptance. |
| Open Datasets | Yes | Our training data consists of around 16M text-image pairs and is collected from multiple open-source datasets, including ALLa VALAION [5], CC12M [4], Kosmos-G [24], LAION-LVIS-220 [34], LAION-COCO-AESTHETIC [18], LAION-COCO-17M [56], and Share GPT4V [6]. We train RLlama Gen for 60, 000 steps with a batch size of 360 and an image resolution of 256 256. The optimizer is Fused Adam W with β1, β2 set to 0.9, 0.95, respectively, and a constant learning rate of 1e-4 is used. We perform training on 8 NVIDIA Tesla A100 GPUs, which takes 39 hours. Image Editing Training Settings. We fine-tune RLlama Gen for image editing by adding two learnable embeddings (i.e., Eemb and Uemb) to specify masking regions. This strategy is computationally efficient, with only 3.6k parameters introduced. Our editing model is trained on the Ultra Edit dataset [60] that comprises 4 million image pairs, where 131k samples are annotated with editing regions. |
| Dataset Splits | Yes | The Magic Brush test set provides editing region annotations for each sample, thereby facilitating the evaluation of region-conditioned editing. This benchmark assesses both multi-turn editing, which evaluates the final image after a series of edits, and single-turn editing, which assesses the target image following an individual edit. The Emu Edit test set does not provide target images; therefore, the evaluation of editing region regeneration is conducted separately from the reconstruction of unedited regions. |
| Hardware Specification | Yes | We perform training on 8 NVIDIA Tesla A100 GPUs, which takes 39 hours. We perform training on 4 NVIDIA Tesla A100 GPUs. |
| Software Dependencies | No | The paper mentions building upon "Llama Gen[42]" and extracting embeddings from "FLAN T5 [7]" but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We train RLlama Gen for 60, 000 steps with a batch size of 360 and an image resolution of 256 256. The optimizer is Fused Adam W with β1, β2 set to 0.9, 0.95, respectively, and a constant learning rate of 1e-4 is used. The model is trained for 3.9M steps with a batch size of 100 and a learning rate of 1e 4. Per common practices [58, 60], we evaluate models at a higher image resolution than that used during training (specifically, 512 512 pixels compared to 256 256 pixels), and fine-tune them on the target resolution for an additional 2, 000 steps. For the Emu Edit benchmark, we train our model with a learning rate of 1e 5 for 60, 000 steps. |