Data Valuation Without Training of a Model
Authors: Ki Nohyun, Hoyong Choi, Hye Won Chung
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We theoretically analyze and empirically demonstrate the effectiveness of the complexity-gap score in finding irregular or mislabeled data instances, and also provide applications of the score in analyzing datasets and diagnosing training dynamics. Our code is publicly available at https://github.com/JJchy/CG_score. ... To evaluate the ability of the CG-score in identifying important examples, we design data pruning experiments, similar to those in Ghorbani & Zou (2019); Paul et al. (2021). We evaluate our score on three public datasets, FMNIST, CIFAR-10/100 and train Res Net networks (He et al., 2016), Res Net18 for FMNIST and CIFAR-10 and Res Net50 for CIFAR-100 dataset, respectively. |
| Researcher Affiliation | Academia | Nohyun Ki , Hoyong Choi , Hye Won Chung School of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST) Daejeon, South Korea {kinohyun, chy0707, hwchung}@kaist.ac.kr |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Our code is publicly available at https://github.com/JJchy/CG_score. |
| Open Datasets | Yes | We evaluate our score on three public datasets, FMNIST, CIFAR-10/100... ImageNet (Deng et al., 2009) images. |
| Dataset Splits | Yes | We create a validation set composed of 1,000 samples by taking a part of the test dataset, and calculate Trac In score with this validation set. |
| Hardware Specification | Yes | Table 4: GPU Nvidia A100 40GB |
| Software Dependencies | No | The paper mentions software like Pytorch and timm module but does not provide specific version numbers for these or other key dependencies. It states, "Pytorch: An imperative style, highperformance deep learning library" and "using the timm module in Py Torch". |
| Experiment Setup | Yes | Table 3: Details for the experiments used in the training of the dataset. FMNIST / CIFAR10 / CIFAR100: Architecture Res Net18 / Res Net18 / Res Net50; Batch size 128; Epochs 100 / 200 / 200; Initial Learning Rate 0.02 / 0.05 / 0.1; Weight decay 5e-4; Optimizer SGD with momentum 0.9; Learning Rate Scheduler Cosine annealing schedule (Loshchilov & Hutter, 2017); Data Augmentation: Normalize by dataset s mean, variance Random Zero Padded Cropping (4 pixels on all sides) Random left-right flipping (probability 0.5). |