Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VTON-VLLM: Aligning Virtual Try-On Models with Human Preferences
Authors: Siqi Wan, Jingwen Chen, Qi Cai, Yingwei Pan, Ting Yao, Tao Mei
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To alleviate this issue, we curate a dataset of synthesized VTON images annotated with human judgments across multiple perceptual criteria. A vision large language model (VLLM), namely VTON-VLLM, is then learnt on these annotations. VTON-VLLM functions as a unified fashion expert and is capable of both evaluating and steering VTON synthesis towards human preferences. Technically, beyond serving as an automatic VTON evaluator, VTON-VLLM upgrades VTON model through two pivotal ways: (1) providing fine-grained supervisory signals during the training of a plug-and-play VTON refinement model, and (2) enabling adaptive and preference-aware test-time scaling at inference. To benchmark VTON models more holistically, we introduce VITON-Bench, a challenging test suite of complex try-on scenarios, and human-preference aware metrics. Extensive experiments demonstrate that powering VTON models with our VTON-VLLM markedly enhances alignment with human preferences. |
| Researcher Affiliation | Collaboration | Siqi Wan1 , Jingwen Chen2, Qi Cai2, Yingwei Pan2, Ting Yao2, Tao Mei2 University of Science and Technology of China1, 2 Hi Dream.ai EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and formulas but does not contain any explicitly labeled pseudocode or algorithm blocks. For example, it presents an objective function LDi T = Et,ϵ N(0,I)||vγ([xt, xcond, xmask], tr) (ϵ x0)||2 2, (1) but this is a mathematical formula, not pseudocode. |
| Open Source Code | Yes | Code is publicly available at: https://github.com/Hi Dream-ai/VTON-VLLM/. |
| Open Datasets | Yes | Datasets. VITON-HD [8] contains 13,679 frontal-view image pairs of women and upper garments. Following previous works [14, 29], we split the dataset into 11,647 training pairs and 2,032 testing pairs. Dress Code [30] consists of 53,795 image pairs, divided into three categories: 15,366 upperbody clothes, 8,951 lower-body clothes, and 29,478 dresses. We adopt the official split, using 1,800 pairs from each category for testing and the remaining pairs for training. ... All the methods are evaluated on three test sets: VITON-HD, Dress Code, and our VITON-Bench. For the paired setting, SSIM [43] and LPIPS [50] are commonly adopted to measure visual similarity between the generated images and the ground-truth images. Furthermore, we leverage our proposed VTON-VLLM as a fashion expert to compute human-preference-aware metrics, which evaluate both garment consistency (GC) and image quality (IQ) across a set of fine-grained attributes, including visual patterns, text characters, sleeve style, garment shape, edge artifacts, and human pose. For the unpaired setting where ground-truth references are unavailable, we employ FID [18], KID [3] , GC and IQ to assess generation quality. |
| Dataset Splits | Yes | VITON-HD [8] contains 13,679 frontal-view image pairs of women and upper garments. Following previous works [14, 29], we split the dataset into 11,647 training pairs and 2,032 testing pairs. Dress Code [30] consists of 53,795 image pairs, divided into three categories: 15,366 upperbody clothes, 8,951 lower-body clothes, and 29,478 dresses. We adopt the official split, using 1,800 pairs from each category for testing and the remaining pairs for training. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models. Section 5.1 'Implementation Details' describes training parameters but not the underlying hardware. |
| Software Dependencies | Yes | Our VTON-VLLM is initialized from Pixtral12B [1] and further fine-tuned on the collected human feedback dataset. ... we propose incorporating the intrinsic in-context visual priors of a pre-trained text-to-image diffusion transformer (i.e., FLUX-Fill 2) into our VRM-Instruct, which frames VTON synthesis as conditional image inpainting. ... 2https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev |
| Experiment Setup | Yes | Implementation Details. (1) VTON-VLLM: Our VTON-VLLM is initialized from Pixtral12B [1] and further fine-tuned on the collected human feedback dataset. The model is trained for 2 epochs with a batch size of 64. The learning rate is set to 0.0001, and Adam W [27] is employed as the optimizer. We incorporate Low-Rank Adaptation (Lo RA) [20] with a rank of 16 for training efficiency. (2) VTON Refinement Model: We also employ Adam W to optimize the model over 50,000 training steps. The learning rate is set to 0.00005 with a warmup over 500 iterations, and the batch size is 1. |