Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Color Conditional Generation with Sliced Wasserstein Guidance

Authors: Alexander Lobashev, Maria Larchenko, Dmitry Guskov

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As a successor to Universal Diffusion Guidance [37], the proposed method is not tied to a specific architecture and can be paired with latent or pixel-space diffusion models. For our experiments we have selected Stable Diffusion 1.5 [45] and Stable Diffusion XL [46] (Dreamshaper-8 [47] and Real Vis XL-V4 [48]) with the DDIM scheduler [43]. Test set The experiments are conducted on images generated from the first 1000 prompts taken from the Contra Styles dataset [49]. Our color references are 1000 photos from Unsplash Lite [50]. We refer to these prompts and photos as the test set. A training set is not needed for our algorithm. Metrics To measure stylization strength, we calculate the Wasserstein-2 distance between color distributions in RGB space. Two content-related metrics are based on CLIP embeddings [42]. CLIP-IQA [51] is a cosine similarity between a generated image and pre-selected anchor vectors that define good-looking pictures. CLIP-T [42] is a cosine similarity between CLIP representations of a text prompt and an image generated from this prompt. In other words, the CLIP-T score indicates whether a modified sampling process still follows the initial text prompt, while CLIP-IQA measures the overall quality of the pictures.
Researcher Affiliation Collaboration Alexander Lobashev1 Maria Larchenko 2 Dmitry Guskov 1, 3 1Glam AI, San Francisco, USA 2Magicly AI, Dubai, UAE, 3Mc Gill University, Montreal, Canada EMAIL
Pseudocode Yes Algorithm 1 Color Conditional Generation with Sliced Wasserstein Guidance 1: Initialize latent vector x T N(0, I), set learning rate λlr, y samples from the reference color distribution 2: for t = T to 1 do 3: u 0 Initialize control vector 4: for j = 1 to M do 5: x t xt + u 6: Get prediction of last latent ˆx0 DDIM(t, x t) 7: Get ˆy0 VAE(ˆx0) Decode latent to image 8: for k = 1 to K do Sliced Wasserstein 9: Project samples on a random direction θ 10: Update loss L L + P |cdfˆy0 cdfy| 11: end for 12: Update control vector u u λlr u L(u) 13: end for 14: Update latent x t xt + u 15: Get denoised latent xt 1 DDIM(t, x t ) 16: end for
Open Source Code Yes Our source code is available at https://github.com/alobashev/sw-guidance.
Open Datasets Yes The experiments are conducted on images generated from the first 1000 prompts taken from the Contra Styles dataset [49]. Our color references are 1000 photos from Unsplash Lite [50]. We refer to these prompts and photos as the test set. A training set is not needed for our algorithm.
Dataset Splits Yes The experiments are conducted on images generated from the first 1000 prompts taken from the Contra Styles dataset [49]. Our color references are 1000 photos from Unsplash Lite [50]. We refer to these prompts and photos as the test set. A training set is not needed for our algorithm.
Hardware Specification Yes Hardware The experiments were conducted on a single workstation equipped with two Nvidia RTX 4090 GPU accelerators and 256 GB of RAM.
Software Dependencies No The paper mentions software like POT library and piq Python library and a CLIP model ID, but it does not specify explicit version numbers for the POT and piq libraries. For example, it does not state "POT 1.0" or "piq 0.5.1".
Experiment Setup Yes We fixed the CFG scale to 5 and the resolution to 768x768 for SDXL. For SD-1.5, the CFG scale was set to 8 and the resolution to 512x512. Both the SDXL and SD pipelines used the DDIM scheduler with 30 inference steps. Images for RB-Modulation were produced by Stable Cascade with a resolution of 1024x1024 and a total of 30 inference steps (20 for stage C and 10 for stage B). Method-specific settings are provided below. Baselines For Instant Style, the SDXL and SD-1.5 scales were set to 1.0. For IP-Adapter, the SDXL scale was set to 0.5 because higher scales tended to ignore the text prompt, producing variations of a reference image. The Colorcanny Control Net for SD-1.5 had a conditioning scale of 1.0. For SW-Guidance, the SD-1.5 learning rate was lr = 0.04. In the SDXL version of SW-Guidance, we did not apply gradient normalization (line 23, Algorithm 2) and set the constant lr = 0.01 × 104 = 100.