Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

An Embedding is Worth a Thousand Noisy Labels

Authors: Francesco Di Salvo, Sebastian Doerrich, Ines Rieger, Christian Ledig

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive quantitative experiments, confirming that WANN has overall greater robustness compared to reference methods (ANN, fixed k-NN, robust loss functions) across diverse datasets and noise levels, including limited and severely imbalanced noisy data scenarios.
Researcher Affiliation	Academia	Francesco Di Salvo EMAIL x AILab Bamberg University of Bamberg, Germany Sebastian Doerrich EMAIL x AILab Bamberg University of Bamberg, Germany Ines Rieger EMAIL x AILab Bamberg University of Bamberg, Germany Christian Ledig EMAIL x AILab Bamberg University of Bamberg, Germany
Pseudocode	Yes	The pseudocode is presented in Algorithm 1, and it requires three parameters: kmin, kmax, and Dtrain. Algorithm 1 reliability(kmin, kmax, Dtrain)
Open Source Code	Yes	The code is available at github.com/francescodisalvo05/wann-noisy-labels.
Open Datasets	Yes	WANN s classification accuracy on CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009), utilizing (kmin, kmax) = (11, 51). We measure the accuracy on two publicly available datasets, namely CIFAR-10N and CIFAR-100N (Wei et al., 2022), which are noisy versions of the established CIFAR datasets. We further assess classification performance on four stratified subsets of Animal-10N (Song et al., 2019) consisting of {500, 1000, 2500, 5000} total samples. We used two datasets from the Med MNIST+ dataset collection (Yang et al., 2023), namely Breast MNIST and Derma MNIST. The experiments cover three different datasets: CIFAR-10, CIFAR-100, and MNIST (Lecun et al., 1998). as illustrated in Figure 2, we present four test samples along with their top 3 closest training examples from STL-10 (Coates et al., 2011), a subset of Image Net (Deng et al., 2009).
Dataset Splits	Yes	excluding 15% for validation in linear training. To demonstrate this, two subsets are randomly sampled from CIFAR-10, each containing only 50 and 100 samples per class. We further assess classification performance on four stratified subsets of Animal-10N (Song et al., 2019) consisting of {500, 1000, 2500, 5000} total samples. For benchmarking on long-tailed problems, we use CIFAR-10LT and CIFAR-100LT, following prior studies (Cao et al., 2019; Du et al., 2023), with imbalance ratios of 1% and 10%, respectively. In CIFAR-10LT, the majority class has 5000 samples, while the minority class consists of 50 samples. Consequently, in CIFAR-100LT, we have 500 and 50 samples, respectively.
Hardware Specification	No	No specific hardware details (GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running the experiments are mentioned in the paper.
Software Dependencies	No	All pre-trained models are obtained from Hugging Face (timm). This phrase indicates software usage but lacks specific version numbers for any libraries or environments, making it insufficient for reproducible dependency information.
Experiment Setup	Yes	The linear probing utilizes the Adam optimizer with a learning rate of 1 10 4 for 100 epochs, with early stopping after 5 epochs. These settings were consistently applied to all experiments in this manuscript. Both ANN and WANN use (kmin, kmax) = (11, 51). As discussed in the method section, both adaptive methods used (kmin, kmax) = (11, 51).