Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning Causal Transition Matrix for Instance-dependent Label Noise

Authors: Jiahui Li, Tai-Wei Chang, Kun Kuang, Ximing Li, Long Chen, Jun Zhou

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We carry out experiments on both synthetic and real-world datasets, encompassing various types of label noise. We place particular emphasis on instance-dependent noise, as it represents the most challenging and significant aspect of our research. Experiment Setup Datasets. (i) We first perform experiments on the manually corrupted version of four synthetic datasets, i.e., Fashion MNIST (Xiao, Rasul, and Vollgraf 2017), SVHN (Yuval 2011), CIFAR10, CIFAR100 (Krizhevsky, Hinton et al. 2009). The experiments are conducted with three different types of artificial label noise... (ii) Meanwhile, we perform experiments on two realworld noisy datasets: Food101 (Bossard, Guillaumin, and Van Gool 2014) and Clothing1M (Xiao et al. 2015). Evaluation metric. The test sets for the various datasets remain clean, ensuring that the test accuracy can effectively reflect the superior performance of the denoising methods we evaluate. For the synthetic datasets, we report the mean performance across 5 random seeds. For the realworld dataset, we report the best results for the last 10 epochs. Baselines. We choose popular denoising methods as baselines... Ablation study. We also performed ablations on all scenarios... Results on Symmetric and Asymmetric Noise... Results on Instance-dependent Noise... Results on Real-world Noisy Dataset
Researcher Affiliation	Collaboration	Jiahui Li1, 2, Tai-Wei Chang2, Kun Kuang1 , Ximing Li2, Long Chen3, Jun Zhou2 1Zhejiang University 2Ant Group 3The Hong Kong University of Science and Technology EMAIL, EMAIL, EMAIL EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the training framework in a block diagram (Figure 2(b)) and explains the steps in paragraph text, but does not present a structured pseudocode or algorithm block.
Open Source Code	No	For further details on the implementation and experimental results, please refer to Li et al. (2024).
Open Datasets	Yes	Datasets. (i) We first perform experiments on the manually corrupted version of four synthetic datasets, i.e., Fashion MNIST (Xiao, Rasul, and Vollgraf 2017), SVHN (Yuval 2011), CIFAR10, CIFAR100 (Krizhevsky, Hinton et al. 2009)... (ii) Meanwhile, we perform experiments on two realworld noisy datasets: Food101 (Bossard, Guillaumin, and Van Gool 2014) and Clothing1M (Xiao et al. 2015).
Dataset Splits	No	The test sets for the various datasets remain clean, ensuring that the test accuracy can effectively reflect the superior performance of the denoising methods we evaluate. For the synthetic datasets, we report the mean performance across 5 random seeds. For the real-world dataset, we report the best results for the last 10 epochs. The paper implies standard train/test splits for the datasets but does not provide specific percentages, sample counts, or explicit details about how these splits were performed beyond the test sets being clean.
Hardware Specification	No	The paper mentions using Res Net models of varying depths (Res Net18, Res Net34, Res Net50) as backbone classifiers, but does not specify any hardware details like GPU models, CPU, or memory used for training or inference.
Software Dependencies	No	The paper mentions several techniques and models (e.g., co-teaching, Gumbel-Softmax function, ResNet) and references past works, but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions).
Experiment Setup	Yes	In every scenario, we assign α1 = α3 = 0.1. The value for α2 varies according to scenario complexity. For datasets lacking extra perturbations on the instances, α2 is set to 0.1. Conversely, when datasets contain perturbations, α2 is adjusted to 0.01. Moreover, β is consistently set to 0.2 across all experiments.