Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RelationAdapter: Learning and Transferring Visual Relation with Diffusion Transformers

Authors: Yan Gong, Yiren Song, Yicheng Li, Chenglin Li, Yin Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on Relation252K show that Relation Adapter significantly improves the model s ability to understand and transfer editing intent, leading to notable gains in generation quality and overall editing performance. Quantitative Evaluation. As shown in Table 1, our method consistently outperforms the baselines in MSE, CLIP-I, and FID metrics. Similarly, when compared with Visual Cloze, our method achieves a notable improvement, reducing the MSE from 0.049 to 0.025, boosting CLIP-I from 0.802 to 0.894, and lowering FID from 7.218 to 4.801. These results demonstrate the effectiveness of our approach in producing both visually accurate and semantically meaningful image edits. Our method also consistently outperforms two state-of-the-art baselines in GPT-C and GPT-A metrics.
Researcher Affiliation Academia 1Zhejiang University 2National University of Singapore EMAIL EMAIL
Pseudocode No The paper describes the methodology using textual explanations and mathematical formulations, along with architectural diagrams (Figure 2), but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Project page: https://github.com/gy8888/Relation Adapter. We provide full access to the Relation252K dataset and the codebase, including training scripts, evaluation pipeline, and detailed instructions for reproducing all experimental results. The links and setup instructions are included in the supplemental material.
Open Datasets Yes We also introduce Relation252K, a comprehensive dataset comprising 218 diverse editing tasks, to evaluate model generalization and adaptability in visual prompt-driven scenarios. The dataset contains 33,274 image pairs, which we further perform permutation to obtain a total of 251,580 training instances. We open-source the full dataset to encourage widespread usage and further research in this field.
Dataset Splits Yes We selected 2.6% of the dataset (6,540 samples) as a benchmark subset, covering a diverse range of 218 tasks. Among these, 6,240 samples correspond to tasks seen during training, while 300 represent unseen tasks used to evaluate the model s generalization capability.
Hardware Specification Yes Training spans 100,000 iterations on 4 H20 GPUs, with an accumulated batch size of 4. We use the Adam W optimizer and bfloat16 mixed-precision training, with an initial learning rate of 1 10 4. The total number of trainable parameters is 1,569.76 million. Training takes 48 hours and consumes 74 GB of GPU memory. At inference, the model requires 40 GB of GPU memory on a single H20 GPU.
Software Dependencies No The paper mentions several models and APIs used (e.g., FLUX.1-dev [2], Sig LIP-SO400M-Patch14-384 [65], GPT-4o [35] multimodal API, Lo RA [21]) and refers to the Di T architecture, but it does not provide specific version numbers for general software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or CUDA versions.
Experiment Setup Yes We initialize our model with FLUX.1-dev [2] within the Di T architecture in training. To reduce computational overhead while retaining the pretrained model s generalization, we fine-tune the In-Context Editor using Lo RA, with a rank of 128. Training spans 100,000 iterations on 4 H20 GPUs, with an accumulated batch size of 4. We use the Adam W optimizer and bfloat16 mixed-precision training, with an initial learning rate of 1 10 4. The attention fusion coefficient α is fixed to 1. To balance computational efficiency, input images are resized, maintaining their aspect ratio, such that the longer side does not exceed 512 pixels prior to encoding. During inference, we set the guidance_scale to 3.5, the number of denoising steps to 24, and the attention fusion weight α to 1.0. A fixed random seed of 1000 was used to ensure reproducibility.