Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Paradox of Certified Training

Authors: Nikola Jovanović, Mislav Balunovic, Maximilian Baader, Martin Vechev

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experimental evaluation with a number of popular convex relaxations provides strong evidence that these factors can explain the drop in certified robustness observed for tighter relaxations. We also systematically explore modifications of existing relaxations and discover that improving unfavorable properties is challenging, as such attempts often harm other properties, revealing a complex tradeoff.
Researcher Affiliation Academia Nikola Jovanović* EMAIL Department of Computer Science, ETH Zurich Mislav Balunović* EMAIL Department of Computer Science, ETH Zurich Maximilian Baader EMAIL Department of Computer Science, ETH Zurich Martin Vechev EMAIL Department of Computer Science, ETH Zurich
Pseudocode No The paper describes methods using mathematical equations and textual explanations, for example in 'Backsubstitution' section and 'Appendix A Backsubstitution Example', but does not contain a dedicated pseudocode block or algorithm in a structured, code-like format.
Open Source Code No The paper does not contain an explicit statement by the authors about releasing their source code, nor does it provide a direct link to a code repository for the methodology described. It mentions a link to OpenReview for peer review, not code.
Open Datasets Yes Next, we perform a thorough evaluation of certified training with all relaxations introduced in Section 2 on 4 widely used datasets (MNIST, Fashion MNIST, SVHN, CIFAR-10) and 2 architectures: FC, a 5-layer dense network, and CONV, a 3-layer convolutional network.
Dataset Splits Yes For MNIST, we tune all hyperparameters thoroughly. We train all models for 200 epochs, starting with a warm-up (Nw epochs) followed by a ramp-up period (Nr epochs) to stabilize the training procedure (Gowal et al., 2018). During the warm-up we train the network naturally. During the ramp-up we gradually increase the perturbation radius ϵ from 0 to ϵtrain, decrease κ from κstart = 1 to κend (shifting from natural to certified training), and for CROWN-IBP gradually shift from CROWN-IBP (R) to IBP loss. We use a batch size of 100 (50 for memory intensive models) and train using the Adam optimizer with the initial learning rate α. Finally, we use L1 regularization with the strength hyperparameter λ. We tune (Nw, Nr, κend, λ, α), as well as the learning rate schedule (milestones, where we reduce the learning rate 10 at epochs 130 and 190, or steps, where we halve it every 20 epochs), and the choice of last layer elision (where we elide the final layer h L of the network with the specifications cy as in Gowal et al. (2018)). For each perturbation radius ϵtest {0.1, 0.2, 0.3}, we train with ϵtrain {0.1, 0.2, 0.3, 0.4} and report the best result. In Table 8 we show the best choice of hyperparameters for each model used in our evaluation (see Appendix F.4 for full results).
Hardware Specification Yes All runs use a single Ge Force RTX 2080 Ti GPU.
Software Dependencies No The paper mentions using Python for implementation (implied by typical machine learning research) and various libraries (e.g., Adam optimizer, PyTorch framework, etc.), but it does not provide specific version numbers for any software components.
Experiment Setup Yes For MNIST, we tune all hyperparameters thoroughly. We train all models for 200 epochs, starting with a warm-up (Nw epochs) followed by a ramp-up period (Nr epochs) to stabilize the training procedure (Gowal et al., 2018). During the warm-up we train the network naturally. During the ramp-up we gradually increase the perturbation radius ϵ from 0 to ϵtrain, decrease κ from κstart = 1 to κend (shifting from natural to certified training), and for CROWN-IBP gradually shift from CROWN-IBP (R) to IBP loss. We use a batch size of 100 (50 for memory intensive models) and train using the Adam optimizer with the initial learning rate α. Finally, we use L1 regularization with the strength hyperparameter λ. We tune (Nw, Nr, κend, λ, α), as well as the learning rate schedule (milestones, where we reduce the learning rate 10 at epochs 130 and 190, or steps, where we halve it every 20 epochs), and the choice of last layer elision (where we elide the final layer h L of the network with the specifications cy as in Gowal et al. (2018)). For each perturbation radius ϵtest {0.1, 0.2, 0.3}, we train with ϵtrain {0.1, 0.2, 0.3, 0.4} and report the best result. In Table 8 we show the best choice of hyperparameters for each model used in our evaluation (see Appendix F.4 for full results).