Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Subclass-Dominant Label Noise: A Counterexample for the Success of Early Stopping

Authors: Yingbin Bai, Zhongyi Han, Erkun Yang, Jun Yu, Bo Han, Dadong Wang, Tongliang Liu

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that Noise Cluster outperforms state-of-the-art baselines on both synthetic and real-world datasets, highlighting the importance of addressing SDN in learning with noisy labels. The code is available at https://github.com/tmllab/2023_Neur IPS_SDN.
Researcher Affiliation Collaboration Yingbin Bai1 Zhongyi Han2 Erkun Yang3 Jun Yu4 Bo Han5 Dadong Wang6 Tongliang Liu1 1Sydney AI Centre, University of Sydney; 2Mohamed bin Zayed University of Artificial Intelligence; 3Xidian University; 4University of Science and Technology of China; 5Hong Kong Baptist University; 6CSIRO
Pseudocode Yes Algorithm 1: Noise Cluster Input: Network fθ; Final layer fξ; Noisy training dataset e D(X, e Y ); Number of epochs for long-trained N; Class number C; DBSCAN Eps and Min Pts. for i = 1, . . . , N do Train fθ and fξ on e D(X, e Y ); // standard training ˆZ t SNE(fθ(X)); for c 1 to C do U c K DBSCAN( ˆZc, Eps, Min Pts); // identify SDN for U c k in U c K do if U c k = largest then Compute set distance with Eq. (1); Update e Y in U c k with Eq. (2) ; Dl Dl U c k; Continually train fθ and fξ on Dl for the rest of the epochs.
Open Source Code Yes The code is available at https://github.com/tmllab/2023_Neur IPS_SDN.
Open Datasets Yes To facilitate research on SDN, we introduce CIFAR20-SDN, a representative SDN dataset built from CIFAR-100, which provides 20 class labels and 100 subclass labels.
Dataset Splits Yes In experiments without SSL, we reserve 10% of the training data as the validation set, while we utilize the entire training data for experiments with SSL.
Hardware Specification Yes All methods run on four core CPU and a single Nvidia V100.
Software Dependencies No No specific software dependencies with version numbers are provided.
Experiment Setup Yes For CIFAR20-SDN, we employ Res Net-34 [19] for experiments without SSL and Pre Act Res Net-18 [20] for experiments with SSL. During optimization, we train the model for 300 epochs, using a learning rate of 2 10 2, a single cycle of cosine annealing [37], a momentum of 0.9, and a weight decay of 5 10 4. We utilize a batch size of 128 and a stopping epoch of 80, with a Close Point value of 20. For DBSCAN hyperparameters, Eps and Min Pts are set to 0.02 and 100, respectively.