Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On Using Certified Training towards Empirical Robustness

Authors: Alessandro De Palma, Serge Durand, Zakaria Chihani, François Terrier, Caterina Urban

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Inspired by recent developments in certified training, which rely on a combination of adversarial attacks with network over-approximations, and by the connections between local linearity and catastrophic overfitting, we present experimental evidence on the practical utility and limitations of using certified training towards empirical robustness. We show that, when tuned for the purpose, a recent certified training algorithm can prevent catastrophic overfitting on single-step attacks, and that it can bridge the gap to multi-step baselines under appropriate experimental settings. Finally, we present a conceptually simple regularizer for network over-approximations that can achieve similar effects while markedly reducing runtime. (...) We present a comprehensive empirical study on the applicability of recent certified training techniques towards empirical robustness, leading to the following contributions: (...) 5 Experimental Study
Researcher Affiliation	Collaboration	Alessandro De Palma EMAIL Inria, École Normale Supérieure, PSL University, CNRS Serge Durand EMAIL Inria, École Normale Supérieure, PSL University, CNRS Université Paris-Saclay, CEA, List Zakaria Chihani EMAIL Université Paris-Saclay, CEA, List François Terrier EMAIL Université Paris-Saclay, CEA, List Caterina Urban EMAIL Inria, École Normale Supérieure, PSL University, CNRS
Pseudocode	No	The paper does not contain any explicitly labeled pseudocode or algorithm blocks. It provides mathematical formulations and descriptions of procedures in paragraph text.
Open Source Code	Yes	Code is available at https://github.com/sergedurand/Certified Training4Empirical Robustness.
Open Datasets	Yes	We focus on three standard 32 32 image classification datasets: CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009), and SVHN (Netzer et al., 2011).
Dataset Splits	Yes	CIFAR-10 and CIFAR-100 consist of 60,000 32 32 RGB images, with 50,000 images for training and 10,000 for testing. CIFAR-10 contains 10 classes, while CIFAR-100 contains 100 classes. SVHN consists of 73,257 32 32 RGB images, with 73,257 images for training and 26,032 for testing. (...) Unless specified otherwise, for tuning purposes or when reporting validation results we use a random 20% holdout of the training set as validation set, and train on the remaining 80%. After tuning and when reporting test set results, we use the standard train and test splits for all datasets.
Hardware Specification	Yes	All timing measurements were carried out on an Nvidia GTX 1080Ti GPU, using 6 cores of an Intel Skylake Xeon 5118 CPU. All the other experiments were run on a single GPU each, allocated from two separate Slurm-based internal clusters. We used the following GPU models from one cluster: Nvidia V100, Nvidia RTX6000, Nvidia RTX8000, Nvidia GTX 1080Ti, Nvidia RTX2080Ti. And the following GPU models from the other cluster: Nvidia Quadro P5000, Nvidia H100.
Software Dependencies	No	Our implementation relies on Py Torch (Paszke et al., 2019) and on the public codebases from de Jorge et al. (2022); Rocamora et al. (2024); De Palma et al. (2024b). We compute IBP bounds using the auto_Li RPA implementation (Xu et al., 2020). The paper mentions PyTorch and auto_Li RPA but does not provide specific version numbers for these software dependencies, nor for Python.
Experiment Setup	Yes	The batch size is set to 128, and SGD with weight decay of 5 10 4 is used for the optimization. (...) On CIFAR-10 and CIFAR-100 we train a Pre Act Resnet18 for 30 epochs with a cyclic learning rate linearly increasing from 0 to 0.2 during the first half of the training then decreasing back to 0. On SVHN the training is done for 15 epochs, with a cyclic learning rate linearly increasing from 0 to 0.05 during 6 epochs, then decreasing back to 0 for the remaining 9 epochs. Furthermore, for SVHN only, the attack perturbation radius is ramped up from 0 to ϵ during the first 5 epochs. (...) The long training schedule used for the experiments in table 1 mirror a setup from Shi et al. (2021), widely adapted in the certified training literature (Müller et al., 2023; Mao et al., 2023; De Palma et al., 2024b). Training is carried out for 160 epochs using the Adam optimizer with a learning rate of 5 10 4, decayed twice by a factor of 0.2 at epochs 120 and 140. Gradient clipping is employed, with the maximal ℓ2 norm of gradients equal to 10.