Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CSGO: Content-Style Composition in Text-to-Image Generation

Authors: Peng Xing, Haofan Wang, Yanpeng Sun, wangqixun, Baixu, Hao Ai, Jen-Yuan Huang, Zechao Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments 5.1 Experimental Setup 5.2 Experimental Results 5.3 Ablation Studies Table 1: Comparison with recent state-of-the-art methods on the test dataset. Table 2: User Preference Score. Table 4: Comparison of style transfer metrics across different methods
Researcher Affiliation Collaboration 1Nanjing University of Science and Technology 2Instant X Team 3Xiaohongshu Inc 4Beihang University 5Peking University
Pseudocode Yes Algorithm 1 Pipeline of Constructing CSSIT Input: content images Setcontent, style images Setstyle Output: Content-style-stylized image triplets Set
Open Source Code Yes Source code: https://github.com/instantX-research/CSGO
Open Datasets Yes We employ the saliency detection datasets, MSRA10K [5, 6] and MSRAB [19], as the content images. In addition, for sketch stylized, we sample 1000 sketch images from Image Net-Sketch [43] as content images. To ensure the richness of the style diversity, we sample 5000 images of different painting styles (history painting, portrait, genre painting, landscape, and still life) from the Wikiart dataset [33].
Dataset Splits Yes Based on the pipeline described in Section 3.1, as shown in Figure 2 (right), we construct a style transfer dataset, IMAGStyle, which contains 210K content-style-stylized image triplets as training dataset. Furthermore, we collect 248 content images from the web containing images of real scenes, sketched scenes, faces, and style scenes, as well as 206 style images of different scenes as testing dataset. For testing, each content image is transferred to 206 styles.
Hardware Specification Yes Our experiments are conducted on 8 NVIDIA H800 GPUs (80GB) with a batch size of 20 per GPU and trained 80000 steps.
Software Dependencies No For the CSGO framework, we employ stabilityai/stable-diffusion-xl-base-1.0 as the base model, pre-trained Vi T-H as image encoder, and TTPlanet/TTPLanet_SDXL_Controlnet_Tile_Realistic as Control Net. we uniformly set the images to 512 512 resolution.
Experiment Setup Yes For the IMAGstyle dataset, during the training phase, we suggest using a [vcp] as a prompt for content images and a [stp] as a prompt for style images. The rank is set to 64 and each B-lo RA is trained with 1000 steps. During the generation phase, we suggest using a [vcp] in [stv] style as the prompt. For the CSGO framework, we employ stabilityai/stable-diffusion-xl-base-1.0 as the base model, pre-trained Vi T-H as image encoder, and TTPlanet/TTPLanet_SDXL_Controlnet_Tile_Realistic as Control Net. we uniformly set the images to 512 512 resolution. The drop rate of text, content image, and style image is 0.15. The learning rate is 1e-4. During training stage, λc = λs = δc = 1.0. During inference stage, we suggest λc = λs = 1.0 and δc = 0.5. Our experiments are conducted on 8 NVIDIA H800 GPUs (80GB) with a batch size of 20 per GPU and trained 80000 steps.