Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Promise of RL for Autoregressive Image Editing

Authors: Saba Ahmadi, Rabiul Awal, Ankur Sikarwar, Amirhossein Kazemnejad, Ge Ya Luo, Juan Rodriguez, Sai Rajeswar Mudumba, Siva Reddy, Chris Pal, Benno Krojer, Aishwarya Agrawal

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To this end, we conduct a series of experiments with three different learning paradigms (SFT, RL and chain-of-thought reasoning) and mixes of data. We empirically show that EARL performs well across many types of edits by evaluating on 6 diverse benchmarks in both IID and OOD settings. We achieve better results than prior state-of-the-art models on the Omni Edit [52], AURORA [26], and Vis Min [2] benchmarks.
Researcher Affiliation Collaboration 1Mila Quebec AI Institute 2Université de Montréal 3Mc Gill University 4École de Technologie Supérieure (ETS) 5Polytechnique Montréal 6Service Now 7Canada CIFAR AI Chair
Pseudocode Yes Algorithm 1: Group Relative Policy Optimization Input: initial policy model πθinit; reward models rϕ; task prompts D; hyperparameters ϵ, β, µ Output: final policy model πθ Function Group Relative Policy Optimization: πθ πθinit ; πref πθinit ; for step = 1 to M do Sample a batch Db from D ; Update the old policy model πθold πθ ; Sample G outputs {oi}G i=1 πθold( |q) for each question q Db ; Compute rewards {ri}G i=1 for each sampled output oi by running rϕ ; Compute ˆAi,t for the t-th token of oi through group relative advantage estimation ; for GRPO iteration = 1 to µ do Update the policy model πθ by maximizing the GRPO objective ;
Open Source Code Yes We release our code, training data, and trained models at https://github.com/mair-lab/EARL.
Open Datasets Yes Our dataset is divided into two categories based on the complexity of the edits: Simple Edits (S) and Complex Edits (C). Simple Edits (S): This category includes relatively simple local edits such as single-object and attribute changes, as well as global edits such as style and environment changes. These types of edits are common in large-scale synthetic datasets, such as Omni Edit [52] with 750K samples. Complex Edits (C): These edits involve more advanced operations, including counting, spatial, and action modifications, where current models often struggle. Datasets like Aurora-AG [26], Aurora-Kubric [26], Vis Min [2], and Something-Something v2 [19] contain such challenging edits. Additionally, we use real-world edit requests curated with human-in-the-loop guidance, such as Human-Edit [3] and Magic Brush [61], which include complex object/attribute changes.
Dataset Splits No The paper describes dataset composition and sampling for training and evaluation subsets, but does not provide explicit train/validation/test splits with percentages or sample counts for its primary combined dataset. It mentions: "For RL post-training, we randomly sample from the respective data pool (S or C) at each iteration, using 16 unique samples per step with 8 rollouts per sample." and "For evaluation, we use a 1000-sample subset for I2EBench and Emu Edit."
Hardware Specification Yes For RL training, rewards were computed using VIEScore through a v LLM API server running Qwen2.5-VL-72Bon 4 NVIDIA H100 GPUs. The training was conducted separately on a different server for 2000 steps, with early stopping based on reward plateaus, also using four NVIDIA H100 GPUs.
Software Dependencies No The paper mentions using specific optimizers (Adam W), techniques (Deep Speed Ze RO stage 3, bfloat16), and models (Emu3, LLa MA-2, Qwen Tokenizer, SBER-Mo VQGAN, vLLM) but does not provide specific version numbers for these software components or libraries.
Experiment Setup Yes For SFT, we initialize from BAAI/Emu3-Stage1 weights, set a learning rate of 1e 4, an effective batch size of 128 with 4 GPUs, a per-device batch size of 4, and 8 gradient accumulation steps. We use validation loss to stop training. For RL post-training, we use a KL divergence coefficient of 3e 4 and a learning rate of 3e 6. The RL model is trained with 8 rollouts per edit instruction and a batch size of 128 and training continues until the reward plateaus. To enhance training stability, we adopt a fully online policy gradient approach, performing a single gradient update at each RL step [25]. All images are resized to 256 256 by maintaining their original aspect ratio.