Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

OmniTry: Virtual Try-On Anything without Masks

Authors: Yutong Feng, Linlin Zhang, Hengyuan Cao, Yiming Chen, Xiaoduan Feng, Jian Cao, Yuxiong Wu, Bin Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Omni Try is evaluated on a comprehensive benchmark consisting of 12 common classes of wearable objects, with both in-shop and in-the-wild images. Experimental results suggest that Omni Try shows better performance on both object localization and ID-preservation compared with existing methods.
Researcher Affiliation	Collaboration	1Kunbyte AI 2Zhejiang University
Pseudocode	No	The paper describes its two-staged pipeline and model architecture in sections 3.2 and 3.3, using descriptive text and figures like Figure 2, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code, model weights, and evaluation benchmark of Omni Try are available at https://omnitry.github.io/.
Open Datasets	Yes	The code, model weights, and evaluation benchmark of Omni Try are available at https://omnitry.github.io/. We construct a comprehensive evaluation benchmark covering 12 common types of wearable objects, dubbed Omni Try-Bench... The benchmark predominantly sources images from public repositories (Pexels2), supplemented with brand website materials and social media content under compliant data usage protocols.
Dataset Splits	Yes	For each sub-type, we collect 15 paired test images for man and woman, separately. ... Overall, the evaluation benchmark contains 360 pairs of images. For the first stage, we gather a diverse dataset containing both in-the-wild portrait images and in-shop model shots. ... The total amount of training pairs is 188,694. For the second stage, we collect paired samples following the 12 basic types in our benchmark. The whole dataset contains 51,195 pairs, which shows class-unbalanced distribution (14,861 pairs for clothes and 295 for ties).
Hardware Specification	Yes	All the experiments are conducted on 4 NVIDIA H800 GPUs.
Software Dependencies	No	The paper mentions using Adam W optimizer [38], bfloat16 mixed precision, fine-tuning based on the distilled version of FLUX [32], and leveraging Flash Attention [12] and Lo RA [21] for implementation. However, it does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup	Yes	We train the first stage with batch-size of 32 for 50K steps, and the second stage with batch-size of 16 for 25K steps. The location and identity adapters are implemented as Lo RA [21] with rank 16. We employ the Adam W [38] optimizer with learning rate of 1e-4 and weight decay of 0.01. For both training of stage-1 and stage-2, we set the learning rate as 1e-4, gradient accumulation steps as 1, weight decay as 0.01 and gradient norm clipping as 1.0. We use the Adam W [38] optimizer with hyper-parameters β1 = 0.9 and β2 = 0.999. The model is trained with mixed precision of bfloat16. The guidance scale is fixed as 1 during training, and set as 30 during inference.