Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Robustness and Accuracy Could Be Reconcilable by (Proper) Definition
Authors: Tianyu Pang, Min Lin, Xiao Yang, Jun Zhu, Shuicheng Yan
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Sec. 5, we validate the effectiveness of replacing KL divergence with distance-based metrics (and their variants), developed from the analyses of SCORE. We improve the state-of-the-art AT methods under Auto Attack (Croce and Hein, 2020), and achieve top-rank performance with 1M DDPM generated data on the leader boards of CIFAR-10 and CIFAR-100 on Robust Bench (Croce et al., 2020). |
| Researcher Affiliation | Collaboration | 1Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint Center for ML, Tsinghua University. 2Sea AI Lab, Singapore. |
| Pseudocode | No | No pseudocode or algorithm blocks were found. |
| Open Source Code | Yes | Code is at https://github.com/P2333/SCORE. |
| Open Datasets | Yes | We improve the state-of-the-art AT methods under Auto Attack (Croce and Hein, 2020), and achieve top-rank performance with 1M DDPM generated data on the leader boards of CIFAR-10 and CIFAR-100 on Robust Bench (Croce et al., 2020). |
| Dataset Splits | Yes | For our methods, we report the results on the checkpoint with the highest value of PGD-10 (SE) accuracy on a separate validation set, similarly to Rice et al. (2020). |
| Hardware Specification | No | The paper mentions using 'large models' and notes 'limited computational resources' but does not provide specific hardware details such as GPU or CPU models used for experiments. |
| Software Dependencies | No | The paper mentions 'Py Torch implementation' and 'SGD momentum optimizer' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | In training, we use SGD momentum optimizer with batch size 128 and weight decay 5e-4. We exploit the PGD-AT (Madry et al., 2018) and TRADES (Zhang et al., 2019) frameworks. The training attack used is 10-steps PGD with step size α = 2/255 for ℓ∞ threat model and α = 16/255 for ℓ2 threat model. The training runs for 110 epochs with the learning rate decaying by a factor of 0.1 at the 100 and 105 epoch, respectively. The hyperparameter β = 6 in the TRADES experiments. |