Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Handling Label Noise via Instance-Level Difficulty Modeling and Dynamic Optimization

Authors: Kuan Zhang, Chengliang Chai, Jingzhe Xu, Chi Zhang, Han Han, Ye Yuan, Guoren Wang, Lei Cao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on six synthetic and real-world LNL benchmarks demonstrate our method surpasses state-of-the-art methods in performance, achieves a nearly 75% reduction in storage and computational time, strongly improving model scalability. Our code is available at https://github.com/i Theresa Apocalypse/IDO.
Researcher Affiliation	Academia	1Beijing Institute of Technology 2University of Arizona EMAIL
Pseudocode	No	The paper describes its methodology using prose, mathematical equations (e.g., in Sections 3.2, 3.3, and 4.2), and figures (e.g., Figure 3 illustrating the framework). However, there are no explicitly labeled pseudocode or algorithm blocks presenting structured steps.
Open Source Code	Yes	Our code is available at https://github.com/i Theresa Apocalypse/IDO.
Open Datasets	Yes	We begin by evaluating the performance of IDO on three popular image classification benchmarks (CIFAR-10, CIFAR-100 [41] and Tiny-Image Net [42]) using synthetic datasets with varying types and ratios of noisy labels. ... We further investigate the performance of IDO on three real-world noisy label datasets: 1) CIFAR-100N [44] ... 2) Clothing1M [45] ... 3) Web Vision [1] ...
Dataset Splits	Yes	For synthetic datasets, we added noise to the entire training set and used the test set to evaluate performance. CIFAR100 consists of 100 classes, with 50,000 training images and 10,000 test images, and we set the batch size to 128. Tiny-Image Net contains 200 classes, with 100,000 training images and 10,000 test images, and we also set the batch size to 128.
Hardware Specification	Yes	All experiment results are the averages of five random runs on a single A100 80G GPU. ... Table 3: Comparison of methods scalability, running DMix, UNICON and IDO on CIFAR100 Inst. 40% noise with larger models on the A100 80GB GPU with a batch size of 64.
Software Dependencies	No	In our experiments, we primarily utilized pre-trained Res Net-50, Vi T-16/B and Conv Ne Xt-B models, both of which were obtained by calling the Py Torch timm library. The paper mentions software components like 'Py Torch timm library' but does not specify exact version numbers for PyTorch, timm, or other dependencies.
Experiment Setup	Yes	We run 5 epochs for stage one to obtain the prior knowledge about wrong event for each sample, and run 10 epochs for stage two to fully robust train the pre-trained model. ... For CIFAR100 consists of 100 classes, with 50,000 training images and 10,000 test images, and we set the batch size to 128. ... Clothing1M is a class-imbalanced dataset, and we sample class-balanced subsets each time, with a batch size of 64 and 1,000 iterations. ... Appendix B includes Table 9: Optimizer configurations for different models and stages, detailing Optimizer (SGD, AdamW), Learning Rate, Weight Decay, and Scheduler (Cosine, No).