Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MaxSup: Overcoming Representation Collapse in Label Smoothing

Authors: Yuxuan Zhou, Heng Li, Zhi-Qi Cheng, Xudong Yan, Yifei Dong, Mario Fritz, Margret Keuper

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive feature-space analyses, we show that Max Sup restores intra-class variation and sharpens inter-class boundaries. Experiments on large-scale image classification and multiple downstream tasks confirm that Max Sup is a more robust alternative to LS. We perform a logit-level analysis of Label Smoothing, revealing how the error amplification term inflates misclassification confidence and compresses features. We demonstrate superior performance across tasks and architectures, including Res Net, Mobile Net V2, and Dei T-S, where Max Sup significantly boosts accuracy on Image Net and consistently delivers stronger representations for downstream tasks such as semantic segmentation and robust transfer learning.
Researcher Affiliation Collaboration α University of Mannheim γ University of Washington ϵ Meta AI β CISPA Helmholtz Center for Information Security δ Max Planck Institute for Informatics
Pseudocode Yes Algorithm 1 Gradient Descent with Max Suppression (Max Sup)
Open Source Code Yes 4https://github.com/Zhou Yuxuan YX/Maximum-Suppression-Regularization.
Open Datasets Yes Experiments on large-scale image classification and multiple downstream tasks confirm that Max Sup is a more robust alternative to LS. Through comprehensive experiments in both image classification (Section 4.2) and semantic segmentation (Section 4.3), we show that Max Sup not only alleviates severe intra-class collapse but also consistently boosts top-1 accuracy and robustly enhances downstream transfer performance (Section 4.1). Max Sup achieves 76.12% accuracy, clearly and consistently outperforming LS. This result underscores that Max Sup directly tackles LS s fundamental shortcoming by maintaining a consistent and meaningful regularization signal even when the top-1 prediction is incorrect. Table 1: Ablation on LS components using Dei T-Small on Image Net-1K (without Cut Mix or Mixup). Table 2: Feature quality of Res Net-50 on Image Net-1K. Table 3: Linear-probe transfer accuracy on CIFAR-10 (higher is better). We further evaluate Max Sup on two fine-grained visual recognition tasks: CUB-200-2011 [37] and Stanford Cars [16]. We performed experiments on the CIFAR-10-LT dataset with imbalance ratios of 50 and 100, following the experimental settings described in [35]. We also conducted experiments on CIFAR10-C benchmark [12] shown in Table 8 following settings in [11]. We evaluate its performance on semantic segmentation using the widely adopted MMSegmentation framework.6 Specifically, we adopt the Uper Net [40] architecture with a Dei T-Small backbone, trained on ADE20K.
Dataset Splits Yes Table 1: Ablation on LS components using Dei T-Small on Image Net-1K (without Cut Mix or Mixup). Table 2: Feature quality of Res Net-50 on Image Net-1K. These benefits are further underscored by the linear-probe transfer accuracy on CIFAR-10 (Table 3). We performed experiments on the CIFAR-10-LT dataset with imbalance ratios of 50 and 100, following the experimental settings described in [35]. To evaluate the effectiveness of Max Sup on out-of-distribution (OOD) settings, we also conducted experiments on CIFAR10-C benchmark [12] shown in Table 8 following settings in [11]. We evaluate its performance on semantic segmentation using the widely adopted MMSegmentation framework.6 Specifically, we adopt the Uper Net [40] architecture with a Dei T-Small backbone, trained on ADE20K.
Hardware Specification No The text or appendix indicates GPU usage (e.g., Res Net on cluster GPUs), approximate training duration, and other relevant details. Though high-level, it suffices to gauge feasibility.
Software Dependencies No We further investigate Max Sup s applicability to downstream tasks by evaluating its performance on semantic segmentation using the widely adopted MMSegmentation framework.6
Experiment Setup Yes For the Res Net Series, we train for 200 epochs using stochastic gradient descent (SGD) with momentum0.9, weight decay of 1 10 4, and a batch size of 2048. The initial learning rate is 0.85 and is annealed via a cosine schedule.5 We also test Res Net variants on CIFAR-100 with a conventional setup: an initial learning rate of 0.1 (reduced fivefold at epochs 60, 120, and 160), training for 200 epochs with batch size 128 and weight decay 5 10 4. For Dei T-Small, we use the official codebase [36], training from scratch without knowledge distillation to isolate Max Sup s contribution. Cut Mix and Mixup are disabled to ensure the model optimization objective remains unchanged.