Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Diverse Condensed Data Generation via Class Preserving Distribution Matching

Authors: Dandan Guo, Zhuo Li, He Zhao, Mingyuan Zhou, Hongyuan Zha

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments reveal that our method can produce more effective condensed data for downstream tasks with less training cost and can also be successfully applied to de-biased dataset condensation.
Researcher Affiliation Academia Dandan Guo EMAIL School of Artificial Intelligence Jilin University; Zhuo Li EMAIL Shenzhen International Center for Industrial and Applied Mathematics, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen; He Zhao EMAIL CSIRO s Data61, Australia; Mingyuan Zhou EMAIL The University of Texas at Austin, USA; Hongyuan Zha EMAIL The Chinese University of Hong Kong, Shenzhen
Pseudocode Yes Algorithm 1 Workflow of our method. Require: Training set T, synthetic set S randomly initialized from T with corresponding labels, off-theshelf classifier gϕ, feature extractor fθ, number of classes C, max-iter and learning rate η. for t in maxiter do Randomly sample θ from Pθ and compute λ; Randomly sample a minibatch BT c Tc and Sc S for each class c in C; Compute L=PC c ε(BT c (θ), Sc(θ)) λEs Sc[Pr(y | s; gϕ) Update S S η SL; end for Output: a synthetic dataset S
Open Source Code Yes 1Code is available on https://github.com/BIRlz/TMLR_Dataset_Condensation
Open Datasets Yes For standard dataset condensation, we consider two widely adopted image datasets, including CIFAR10 and CIFAR100. CIFAR10 (Krizhevsky et al., 2009) and CIFAR100 (Krizhevsky et al., 2009) consists of tiny colored natural images from 10 and 100 categories, respectively. Each dataset has 50K training images and 10K test images with the size of 32 32. Besides, following Yin et al. (2023), we also evaluate our proposed method on large-scale Tiny Image Net. Tiny Image Net (Le & Yang, 2015) incorporates 200 classes derived from Image Net1K, with each class comprising 500 images processing a resolution 64 64. ... Here we adopt the Celeb A face dataset (Liu et al., 2015) that includes 40 attributes.
Dataset Splits Yes CIFAR10 (Krizhevsky et al., 2009) and CIFAR100 (Krizhevsky et al., 2009) consists of tiny colored natural images from 10 and 100 categories, respectively. Each dataset has 50K training images and 10K test images with the size of 32 32. ... Tiny Image Net (Le & Yang, 2015) incorporates 200 classes derived from Image Net1K, with each class comprising 500 images processing a resolution 64 64. ... Here we adopt the Celeb A face dataset (Liu et al., 2015) that includes 40 attributes. ... we resize the original image into 64 64 and keep the proportions of 4 groups in original dataset to randomly drop some samples, resulting in 10000 training examples with 85 in the smallest group (blond-haired males). ... We adopt the official test split to evaluate the methods... Table 13: Number of samples in different groups in training dataset and test dataset on Celeb A. ... Test dataset 180 2480 7536 9767 19963.
Hardware Specification Yes Table 3: Comparison of training speed (s/iter) and peak GPU memory (GB) on CIFAR-100 with a single NVIDIA A100 80G.
Software Dependencies No No specific software dependencies with version numbers are explicitly mentioned in the text. While deep learning frameworks like PyTorch are implied by the use of neural networks, their versions are not specified.
Experiment Setup Yes Following previous work (Zhao & Bilen, 2023; Wang et al., 2025), we adopt the commonly-used DSA augmentation (Zhao & Bilen, 2021) and learn 1/10/50 image(s) per class as (IPC) synthetic sets for CIFAR-10/100 using the Conv Net architecture. ... We set maxiter = 40K and learn model parameters ϕ using the original training set, which adopts the same architecture and setting with the downstream classifier in the standard dataset condensation. ... We list our used learning rate for each dataset with varying IPCs in Table 10 and the mini-batch size of the real dataset in Table. 11. We also report the pretraining details for the model parameters ϕ on different datasets in Table 12.