reproducibilityindex.ai

Improved OOD Generalization via Adversarial Training and Pretraing

Authors: Mingyang Yi, Lu Hou, Jiacheng Sun, Lifeng Shang, Xin Jiang, Qun Liu, Zhiming Ma

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct various experiments on both image classiﬁcation (IC) and natural language understanding (NLU) tasks to verify our theoretical ﬁndings. For IC task, we conduct AT on CIFAR10 (Krizhevsky & Hinton, 2009) and Image Net (Deng et al., 2009), and then evaluate the OOD generalization of these models on corrupted OOD data CIFAR10-C and Image Net-C (Hendrycks & Dietterich, 2018). Empirical results on both IC and NLU tasks verify that AT improves OOD generalization. Table 1: Clean and corruption accuracy (%) of Res Net34 on CIFAR10-C and Image Net-C using standard training and adversarial training under both ℓ2-norm and ℓ -norm.
Researcher Affiliation	Collaboration	1University of Chinese Academy of Sciences, Beijing, China 2Academy of Mathematics and Systems Science, Chinese Academy of China 3Huawei Noah s Ark Lab, Shenzhen, China.
Pseudocode	Yes	Algorithm 1 Multi-Step SGD. Input: Number of training steps T, learning rate for model parameters ηwt and adversarial input ηx, two initialization points w1, δ1, constant p {2, } and perturbation size r. Return w T +1. 1: for t = 1, , T do 2: Uniformly sample it from {1, , n}. 3: for k = 1, , K do 4: δk+1 = Proj Bp(0,r) (δk+ηx xf(wt, xit +δk)). 5: end for 6: wt+1 = wt ηwt wf(wt, xit + δK+1). 7: end for
Open Source Code	No	The paper does not include an unambiguous statement or link indicating that the source code for the methodology described in this paper is publicly available.
Open Datasets	Yes	We use the following benchmark datasets. CIFAR10 (Krizhevsky & Hinton, 2009) has 50000 colorful images as training samples from 10 object classes. Image Net (Deng et al., 2009) contains colorful images with over 1 million training samples from 1,000 categories. SST-2 (Socher et al., 2013) and IMDb (Maas et al., 2011) are sentiment analysis datasets... STS-B consists of texts from different genres and sources... (Cer et al., 2017). MNLI is a textual entailment dataset... (Williams et al., 2018).
Dataset Splits	No	The paper describes using CIFAR10-C and Image Net-C as OOD (out-of-distribution) data for evaluation, stating 'Each type of corruption has ﬁve levels of severity, and each severity has 10000 validation samples' for CIFAR10-C, which refers to evaluation samples, not a traditional validation split for hyperparameter tuning on the training dataset. It does not provide specific train/validation splits (e.g., 80/10/10 percentages or counts) for the primary training datasets (CIFAR10, ImageNet, etc.).
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments (e.g., specific GPU models, CPU models, or memory specifications).
Software Dependencies	No	The paper mentions software components like 'BERT' and 'Adam W' but does not provide specific version numbers for any software dependencies, such as programming languages, libraries, or frameworks.
Experiment Setup	Yes	The number of inner loop steps K is 8 for CIFAR10, and 3 for Image Net. The models are trained by SGD with momentum. The number of training epochs is 200 for CIFAR10, and 100 for Image Net. The learning rate starts from 0.1 and decays by a factor 0.2 at epochs 60, 120, 160 (resp. 30, 60, 90) for CIFAR10 (resp. Image Net). Detailed hyperparameters are in Appendix C. The models are trained by Adam W (Loshchilov & Hutter, 2018) for 10 epochs. Detailed hyperparameters are in Appendix C.