Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

UtilGen: Utility-Centric Generative Data Augmentation with Dual-Level Task Adaptation

Authors: Jiyu Guo, Shuo Yang, Yiming Huang, Yancheng Long, Xiaobo Xia, Xiu Su, Bo Zhao, Zeke Xie, Liqiang Nie

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on eight benchmark datasets of varying complexity and granularity demonstrate that UTILGEN consistently achieves superior performance, with an average accuracy improvement of 3.87% over previous SOTA. Further analysis of data influence and distribution reveals that UTILGEN produces more impactful and task-relevant synthetic data, validating the effectiveness of the paradigm shift from visual characteristics-centric to task utility-centric data augmentation.
Researcher Affiliation Academia 1Harbin Institute of Technology, Shenzhen 2National University of Singapore 3Mo E Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China 4Central South University 5Shanghai Jiao Tong University 6Hong Kong University of Science and Technology (Guangzhou) { EMAIL, EMAIL }
Pseudocode Yes Algorithm 1 Task-Oriented Data Valuation (TODV) Input: Training data Dr Dg = {(xi, yi)}N i=1; validation data Dv = {(xj, yj)}M j=1 Required: Classifier parameters θ ; weight network parameters ϕ; max iteration T. ... Algorithm 2 Instance-Level Generation Policy Optimization (ILPO) ... Algorithm 3 Model-Level Generation Capability Optimization (MLCO)
Open Source Code Yes Our study uses exclusively publicly accessible datasets and includes the complete implementation source code in the supplementary materials (provided as a ZIP archive).
Open Datasets Yes We evaluate the effectiveness of UTILGEN across eight datasets spanning three classification tasks: coarse-grained classification (Image Net-1k-Subset [73], Image Net-100-Subset [73], and Caltech 101 [74]), fine-grained classification (Oxford Pets [75], Food-S [76], Flowers 102 [68], and CUB-200-2011 [77]), and texture classification (DTD [78]).
Dataset Splits Yes Image Net-1k-Subset [73] and Image Net-100-Subset [73], both randomly sampled with 100 images per class. ... Food101-Subset [76] being a curated subset of Food101 [76] containing 101 food categories. Other datasets follow their original training and validation setups. ... Detailed dataset statistics are provided in Table 16, summarizing the number of classes, training samples, and test samples for each dataset.
Hardware Specification Yes In contrast, generating the same amount of data using UTILGEN requires only about 0.94 hours and $20 on 8 V100 GPUs rented from Google Cloud.
Software Dependencies No The paper mentions specific software components like 'Stable Diffusion v2.1 [4]' and 'Deep Seek-R1-Distill-Qwen-1.5B [79]', but it does not provide version numbers for general programming languages or common libraries such as Python, PyTorch, or TensorFlow, which are necessary for full reproducibility.
Experiment Setup Yes The training settings for both the classifier and the weight network, including optimizers, learning rates, and batch sizes, are summarized in Table 8. ... The key hyperparameters used in our DPO training process are summarized in Table 9. ... The specific hyperparameters used for this optimization process are summarized in Table 11. ... The image generation process is guided by the hyperparameters listed in Table 12. ... The training configuration uses SGD optimizer with momentum 0.9 and weight decay 5e-4. The learning rate starts at 0.01 with cosine decay schedule over 100 epochs. A fixed batch size of 256 is used for all experiments, with standard data augmentation including random horizontal flips and crops.