DVERGE: Diversifying Vulnerabilities for Enhanced Robust Generation of Ensembles

Authors: Huanrui Yang, Jingyang Zhang, Hongliang Dong, Nathan Inkawhich, Andrew Gardner, Andrew Touchet, Wesley Wilkes, Heath Berry, Hai Li

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare DVERGE with various counterparts, including Baseline which trains an ensemble in a standard way and two previous robust ensemble training methods: ADP [12] and GAL [13]. For a fair comparison, we use ResNet-20 [27] as sub-models and average the output probabilities after the soft-max layer of each sub-model to yield the final predictions of ensembles. All the evaluations are performed on the CIFAR-10 dataset [28].
Researcher Affiliation Collaboration Huanrui Yang1 , Jingyang Zhang1 , Hongliang Dong1 , Nathan Inkawhich1, Andrew Gardner2, Andrew Touchet2, Wesley Wilkes2, Heath Berry2, Hai Li1 1Department of Electrical and Computer Engineering, Duke University 2Radiance Technologies 1{huanrui.yang, jz288, hongliang.dong, nai2, hai.li}@duke.edu, 2{andrew.gardner, atouchet, Wesley.Wilkes, Heath.Berry}@radiancetech.com
Pseudocode Yes Algorithm 1 shows the pseudo-code for training an ensemble of N sub-models.
Open Source Code Yes The code of this work is available at https://github.com/zjysteven/DVERGE.
Open Datasets Yes All the evaluations are performed on the CIFAR-10 dataset [28].
Dataset Splits No The paper mentions training and testing on CIFAR-10 but does not explicitly provide details about a validation split.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No We implement DVERGE with PyTorch [38] on NVIDIA GPUs with Adam optimizer [37].
Experiment Setup Yes Training configuration details can be found in Appendix A. For DVERGE, we use PGD with momentum [29] to perform the feature distillation in Equation (1). We conduct 10 steps of gradient descent during feature distillation with a step size of ϵ/10. The ϵ used for each ensemble size to achieve the results in this section was empirically chosen for the highest diversity and lowest transferability, such that ϵ = 0.07, 0.05, 0.05 for ensembles with 3, 5, and 8 sub-models, respectively. [...] We use a batch size of 128. The learning rate is initialized as 0.001 and decays to 0.0001 after 200 epochs and further decays to 0.00001 after 300 epochs. The total training epoch is 350. For the PGD attack in feature distillation, we apply 10 steps with step size ϵ/10 and 5 random starts. The initial pretraining is done with 50 epochs.