Data Valuation Without Training of a Model

Authors: Ki Nohyun, Hoyong Choi, Hye Won Chung

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We theoretically analyze and empirically demonstrate the effectiveness of the complexity-gap score in finding irregular or mislabeled data instances, and also provide applications of the score in analyzing datasets and diagnosing training dynamics. Our code is publicly available at https://github.com/JJchy/CG_score. ... To evaluate the ability of the CG-score in identifying important examples, we design data pruning experiments, similar to those in Ghorbani & Zou (2019); Paul et al. (2021). We evaluate our score on three public datasets, FMNIST, CIFAR-10/100 and train Res Net networks (He et al., 2016), Res Net18 for FMNIST and CIFAR-10 and Res Net50 for CIFAR-100 dataset, respectively.
Researcher Affiliation Academia Nohyun Ki , Hoyong Choi , Hye Won Chung School of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST) Daejeon, South Korea {kinohyun, chy0707, hwchung}@kaist.ac.kr
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Our code is publicly available at https://github.com/JJchy/CG_score.
Open Datasets Yes We evaluate our score on three public datasets, FMNIST, CIFAR-10/100... ImageNet (Deng et al., 2009) images.
Dataset Splits Yes We create a validation set composed of 1,000 samples by taking a part of the test dataset, and calculate Trac In score with this validation set.
Hardware Specification Yes Table 4: GPU Nvidia A100 40GB
Software Dependencies No The paper mentions software like Pytorch and timm module but does not provide specific version numbers for these or other key dependencies. It states, "Pytorch: An imperative style, highperformance deep learning library" and "using the timm module in Py Torch".
Experiment Setup Yes Table 3: Details for the experiments used in the training of the dataset. FMNIST / CIFAR10 / CIFAR100: Architecture Res Net18 / Res Net18 / Res Net50; Batch size 128; Epochs 100 / 200 / 200; Initial Learning Rate 0.02 / 0.05 / 0.1; Weight decay 5e-4; Optimizer SGD with momentum 0.9; Learning Rate Scheduler Cosine annealing schedule (Loshchilov & Hutter, 2017); Data Augmentation: Normalize by dataset s mean, variance Random Zero Padded Cropping (4 pixels on all sides) Random left-right flipping (probability 0.5).