Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Label Noise-Robust Learning using a Confidence-Based Sieving Strategy

Authors: Reihaneh Torkzadehmahani, Reza Nasirigerdeh, Daniel Rueckert, Georgios Kaissis

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Then, we experimentally illustrate the superior performance of our proposed approach compared to recent studies on various settings, such as synthetic and real-world label noise. Moreover, we show CONFES can be combined with other state-of-the-art approaches, such as Co-teaching and Divide Mix to further improve model performance.
Researcher Affiliation	Academia	Reihaneh Torkzadehmahani EMAIL Technical University of Munich Reza Nasirigerdeh EMAIL Technical University of Munich Helmholtz Munich Daniel Rueckert EMAIL Technical University of Munich Imperial College London Georgios Kaissis EMAIL Technical University of Munich Helmholtz Munich
Pseudocode	Yes	Algorithm 1: Confidence error based sieving (CONFES) Algorithm 2: Instance-dependent Label Noise Generation taken from Xia et al. (2020)
Open Source Code	Yes	The code is available at: https://github.com/reihaneh-torkzadehmahani/confes
Open Datasets	Yes	We utilize the CIFAR-10/100 datasets (Krizhevsky et al., 2009) and make them noisy using different types of synthetic label noise. Furthermore, we incorporate the Clothing1M dataset (Xiao et al., 2015), a naturally noisy benchmark dataset widely employed in previous studies.
Dataset Splits	Yes	CIFAR-10/100 contain 50000 training samples and 10000 testing samples of shape 32 32 from 10/100 classes. For the CIFAR datasets, we perturb the training labels using symmetric, pairflip, and instance-dependent label noise introduced in Xia et al. (2020), but keep the test set clean. ... Clothing1M is a real-world dataset of 1 million images of size 224 224 with noisy labels (whose estimated noise level is approximately 38% (Wei et al., 2022; Song et al., 2019)) and 10k clean test images from 14 classes.
Hardware Specification	Yes	We conduct the experiments on a single GPU system equipped with an NVIDIA RTX A6000 graphic processor and 48GB of GPU memory.
Software Dependencies	Yes	Our method is implemented in Py Torch v1.9.
Experiment Setup	Yes	For all methods, we evaluate the average test accuracy on the last five epochs, and for co-teaching, we report the average of this metric for the two networks. Following previous works (Li et al., 2020; Bai et al., 2021), we train the Pre Act Res Net-18 (He et al., 2016) model on CIFAR-10 and CIFAR-100 using the SGD optimizer with momentum of 0.9, weight decay of 5e-4, and batch size of 128. The initial learning rate is set to 0.02, which is decreased by 0.01 in 300 epochs using cosine annealing scheduler (Loshchilov & Hutter, 2017). For the Cloting1M dataset, we adopt the setting from Li et al. (2020) and train the Res Net-50 model for 80 epochs. The optimizer is SGD with momentum of 0.9 and weight decay of 1e-3. The initial learning rate is 0.002, which is reduced by factor of 10 at epoch 40. At each epoch, the model is trained on 1000 mini-batches of size 32.