Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the Paradox of Certified Training

Authors: Nikola Jovanović, Mislav Balunovic, Maximilian Baader, Martin Vechev

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experimental evaluation with a number of popular convex relaxations provides strong evidence that these factors can explain the drop in certiﬁed robustness observed for tighter relaxations. We also systematically explore modiﬁcations of existing relaxations and discover that improving unfavorable properties is challenging, as such attempts often harm other properties, revealing a complex tradeoﬀ.
Researcher Affiliation	Academia	Nikola Jovanović* EMAIL Department of Computer Science, ETH Zurich Mislav Balunović* EMAIL Department of Computer Science, ETH Zurich Maximilian Baader EMAIL Department of Computer Science, ETH Zurich Martin Vechev EMAIL Department of Computer Science, ETH Zurich
Pseudocode	No	The paper describes methods using mathematical equations and textual explanations, for example in 'Backsubstitution' section and 'Appendix A Backsubstitution Example', but does not contain a dedicated pseudocode block or algorithm in a structured, code-like format.
Open Source Code	No	The paper does not contain an explicit statement by the authors about releasing their source code, nor does it provide a direct link to a code repository for the methodology described. It mentions a link to OpenReview for peer review, not code.
Open Datasets	Yes	Next, we perform a thorough evaluation of certiﬁed training with all relaxations introduced in Section 2 on 4 widely used datasets (MNIST, Fashion MNIST, SVHN, CIFAR-10) and 2 architectures: FC, a 5-layer dense network, and CONV, a 3-layer convolutional network.
Dataset Splits	Yes	For MNIST, we tune all hyperparameters thoroughly. We train all models for 200 epochs, starting with a warm-up (Nw epochs) followed by a ramp-up period (Nr epochs) to stabilize the training procedure (Gowal et al., 2018). During the warm-up we train the network naturally. During the ramp-up we gradually increase the perturbation radius ϵ from 0 to ϵtrain, decrease κ from κstart = 1 to κend (shifting from natural to certiﬁed training), and for CROWN-IBP gradually shift from CROWN-IBP (R) to IBP loss. We use a batch size of 100 (50 for memory intensive models) and train using the Adam optimizer with the initial learning rate α. Finally, we use L1 regularization with the strength hyperparameter λ. We tune (Nw, Nr, κend, λ, α), as well as the learning rate schedule (milestones, where we reduce the learning rate 10 at epochs 130 and 190, or steps, where we halve it every 20 epochs), and the choice of last layer elision (where we elide the ﬁnal layer h L of the network with the speciﬁcations cy as in Gowal et al. (2018)). For each perturbation radius ϵtest {0.1, 0.2, 0.3}, we train with ϵtrain {0.1, 0.2, 0.3, 0.4} and report the best result. In Table 8 we show the best choice of hyperparameters for each model used in our evaluation (see Appendix F.4 for full results).
Hardware Specification	Yes	All runs use a single Ge Force RTX 2080 Ti GPU.
Software Dependencies	No	The paper mentions using Python for implementation (implied by typical machine learning research) and various libraries (e.g., Adam optimizer, PyTorch framework, etc.), but it does not provide specific version numbers for any software components.
Experiment Setup	Yes	For MNIST, we tune all hyperparameters thoroughly. We train all models for 200 epochs, starting with a warm-up (Nw epochs) followed by a ramp-up period (Nr epochs) to stabilize the training procedure (Gowal et al., 2018). During the warm-up we train the network naturally. During the ramp-up we gradually increase the perturbation radius ϵ from 0 to ϵtrain, decrease κ from κstart = 1 to κend (shifting from natural to certiﬁed training), and for CROWN-IBP gradually shift from CROWN-IBP (R) to IBP loss. We use a batch size of 100 (50 for memory intensive models) and train using the Adam optimizer with the initial learning rate α. Finally, we use L1 regularization with the strength hyperparameter λ. We tune (Nw, Nr, κend, λ, α), as well as the learning rate schedule (milestones, where we reduce the learning rate 10 at epochs 130 and 190, or steps, where we halve it every 20 epochs), and the choice of last layer elision (where we elide the ﬁnal layer h L of the network with the speciﬁcations cy as in Gowal et al. (2018)). For each perturbation radius ϵtest {0.1, 0.2, 0.3}, we train with ϵtrain {0.1, 0.2, 0.3, 0.4} and report the best result. In Table 8 we show the best choice of hyperparameters for each model used in our evaluation (see Appendix F.4 for full results).