Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Efficient Representativeness-Aware Coreset Selection
Authors: Zihao Cheng, Binrui Wu, Zhiwei Li, Yuesen Liao, Su Zhao, Shuai Chen, Yuan Gao, Weizhong Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on multiple datasets confirm the effectiveness of our approach. Notably, compared with existing gradient-based dynamic coreset selection baselines, our method achieves up to a 5.4% improvement in test accuracy across multiple datasets. We validate our method on multiple benchmark datasets, achieving higher test accuracy with negligible additional training cost. Section 5: Experiments |
| Researcher Affiliation | Collaboration | 1Fudan University, 2Meituan Inc, 3Wuhan University 4Shanghai Key Laboratory of Intelligent Information Processing |
| Pseudocode | Yes | Algorithm 1 Efficient Representativeness-Aware Coreset Selection (ERACS) |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Yes, we will make our experimental code available in the supplementary materials. |
| Open Datasets | Yes | We evaluate our proposed method, Efficient Representativeness-Aware Coreset Selection (ERACS), on three standard datasets: CIFAR-10, CIFAR-100, and Image Net. The data used in our experiments is open source. |
| Dataset Splits | Yes | Models are trained for 300 epochs on CIFAR datasets and 350 epochs on Image Net. We fix the coreset budget at 10% across all settings. Table 2: Test accuracy (%) under different data budgets using Res Net-18. |
| Hardware Specification | Yes | All experiments were run on Nvidia 4090 GPUs. |
| Software Dependencies | No | For training, we use stochastic gradient descent (SGD) with momentum 0.9, weight decay 5 10 4, and a cosine-annealed learning rate starting from 0.1. Models are trained for 300 epochs on CIFAR datasets and 350 epochs on Image Net. We fix the coreset budget at 10% across all settings. |
| Experiment Setup | Yes | For training, we use stochastic gradient descent (SGD) with momentum 0.9, weight decay 5 10 4, and a cosine-annealed learning rate starting from 0.1. Models are trained for 300 epochs on CIFAR datasets and 350 epochs on Image Net. We fix the coreset budget at 10% across all settings. To decide when to update the coreset, ERACS monitors the SNR of the gradient norms every C epoch and triggers reselection only when the SNR exceeds a threshold τsnr. This mechanism ensures minimal overhead while preserving representativeness. We analyze τsnr {0.1, 0.2, 0.3, 0.4, 0.5} to study its impact on model performance and efficiency. We evaluate C {1, 2, 3, 4, 5, 10} to determine optimal SNR monitoring frequency. |