Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
An Embedding is Worth a Thousand Noisy Labels
Authors: Francesco Di Salvo, Sebastian Doerrich, Ines Rieger, Christian Ledig
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive quantitative experiments, confirming that WANN has overall greater robustness compared to reference methods (ANN, fixed k-NN, robust loss functions) across diverse datasets and noise levels, including limited and severely imbalanced noisy data scenarios. |
| Researcher Affiliation | Academia | Francesco Di Salvo EMAIL x AILab Bamberg University of Bamberg, Germany Sebastian Doerrich EMAIL x AILab Bamberg University of Bamberg, Germany Ines Rieger EMAIL x AILab Bamberg University of Bamberg, Germany Christian Ledig EMAIL x AILab Bamberg University of Bamberg, Germany |
| Pseudocode | Yes | The pseudocode is presented in Algorithm 1, and it requires three parameters: kmin, kmax, and Dtrain. Algorithm 1 reliability(kmin, kmax, Dtrain) |
| Open Source Code | Yes | The code is available at github.com/francescodisalvo05/wann-noisy-labels. |
| Open Datasets | Yes | WANN s classification accuracy on CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009), utilizing (kmin, kmax) = (11, 51). We measure the accuracy on two publicly available datasets, namely CIFAR-10N and CIFAR-100N (Wei et al., 2022), which are noisy versions of the established CIFAR datasets. We further assess classification performance on four stratified subsets of Animal-10N (Song et al., 2019) consisting of {500, 1000, 2500, 5000} total samples. We used two datasets from the Med MNIST+ dataset collection (Yang et al., 2023), namely Breast MNIST and Derma MNIST. The experiments cover three different datasets: CIFAR-10, CIFAR-100, and MNIST (Lecun et al., 1998). as illustrated in Figure 2, we present four test samples along with their top 3 closest training examples from STL-10 (Coates et al., 2011), a subset of Image Net (Deng et al., 2009). |
| Dataset Splits | Yes | excluding 15% for validation in linear training. To demonstrate this, two subsets are randomly sampled from CIFAR-10, each containing only 50 and 100 samples per class. We further assess classification performance on four stratified subsets of Animal-10N (Song et al., 2019) consisting of {500, 1000, 2500, 5000} total samples. For benchmarking on long-tailed problems, we use CIFAR-10LT and CIFAR-100LT, following prior studies (Cao et al., 2019; Du et al., 2023), with imbalance ratios of 1% and 10%, respectively. In CIFAR-10LT, the majority class has 5000 samples, while the minority class consists of 50 samples. Consequently, in CIFAR-100LT, we have 500 and 50 samples, respectively. |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running the experiments are mentioned in the paper. |
| Software Dependencies | No | All pre-trained models are obtained from Hugging Face (timm). This phrase indicates software usage but lacks specific version numbers for any libraries or environments, making it insufficient for reproducible dependency information. |
| Experiment Setup | Yes | The linear probing utilizes the Adam optimizer with a learning rate of 1 10 4 for 100 epochs, with early stopping after 5 epochs. These settings were consistently applied to all experiments in this manuscript. Both ANN and WANN use (kmin, kmax) = (11, 51). As discussed in the method section, both adaptive methods used (kmin, kmax) = (11, 51). |