Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Geometric Median (GM) Matching for Robust k-Subset Selection from Noisy Data

Authors: Anish Acharya, Sujay Sanghavi, Alex Dimakis, Inderjit S Dhillon

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across image classification and image generation tasks demonstrate that GM MATCHING consistently outperforms existing pruning approaches, particularly in high-corruption settings; making it a strong baseline for robust data pruning. ... We conduct comprehensive experiments across a range of tasks including image classification, unsupervised distribution matching, and image generation. Our benchmarks cover diverse noise types: feature corruptions, label noise, and adversarial attacks.
Researcher Affiliation Collaboration 1University of Texas at Austin 2Amazon 3University of California, Berkeley 4Bespoke Labs 5Google. Correspondence to: Anish Acharya <EMAIL>.
Pseudocode Yes Algorithm 1 GEOMETRIC MEDIAN MATCHING (initialization) A finite collection of grossly corrupted (Definition 1) observations D = {xi Rd}n i=1; pretrained encoder ϕ( ) : Rd Rs e.g. CLIP (Radford et al., 2021b); initial weight vector θ0 Rs; number of sampling batches B, population fraction for GM computation 0 < γGM 1.
Open Source Code Yes Code is available publicly at Github.
Open Datasets Yes Our experiments span popular deep nets including Res Net-18/50, VGG-16, Shuffle Net, SENet, Efficient Net-B0 across three popular Image Classification datasets Tiny-Image Net, CIFAR10/100. ... Additionally, we conduct experiments on an unconditional image generation task using a diffusion model. Specifically, we train a U-Net with Denoising Diffusion Probabilistic Models (DDPM) (Ho et al., 2020) on the MNIST dataset.
Dataset Splits No The paper does not explicitly state the training/test/validation split percentages or sample counts for the main datasets (CIFAR-100, Tiny Image Net, MNIST, Image Net-1K) used in the experiments. While it refers to 'selection ratio' for pruning (e.g., 'prune datasets at selection ratios ranging from 20% to 80%'), this is different from the dataset's intrinsic train/test/validation partitioning.
Hardware Specification Yes The results were generated using synthetic data drawn from a standard normal distribution, with the geometric median computed iteratively until convergence (tolerance ϵ = 10 5, maximum iterations = 100). For each combination of n and s, wall clock time was averaged across 10 random seed. The computational cost increases with both n and s: for fixed n, the scaling with s is approximately linear, while for fixed s, scaling with n exhibits sub-linear to near-linear growth. These results emphasize the trade-offs in selecting n and s for practical applications. Experiments were run on a single-threaded CPU setup.
Software Dependencies No The paper mentions optimizers (SGD, AdamW) and various deep learning models (U-Net, ResNet, VGG, EfficientNet, CLIP), but it does not specify versions for core software libraries like Python, PyTorch, or TensorFlow, nor specific versions for other ancillary software or solvers.
Experiment Setup Yes For CIFAR-10/100 datasets, the training configuration consists of a batch size of 128, SGD optimizer with momentum (0.9), weight decay of 5 10 4, and an initial learning rate of 0.1. The learning rate undergoes step-wise decay by a factor of 5 at epochs 60, 120, and 160, totaling 200 epochs. Data augmentation strategies incorporate random cropping and random horizontal flipping. For Tiny-Image Net and Image Net-1k experiments, we use a batch size of 256, SGD optimizer with momentum (0.9), weight decay of 1 10 4, and an initial learning rate of 0.1. The learning rate decreases by a factor of 10 at epochs 30 and 60, across 90 total epochs, employing random horizontal flipping for data augmentation. For training stability and optimal convergence, we adopt specific hyperparameter settings. The batch size is set to 128 to ensure efficient mini-batch updates. The learning rate is fixed at 1 10 4, tuned for stable convergence. We use the Adam W optimizer due to its adaptive learning rate properties and weight decay regularization. The number of diffusion time-steps is set to 1000, providing sufficient granularity for high-resolution generative refinement. A linear noise schedule is applied where βt increases linearly over time-steps, preventing abrupt changes in noise levels.