reproducibilityindex.ai

Combating Exacerbated Heterogeneity for Robust Models in Federated Learning

Authors: Jianing Zhu, Jiangchao Yao, Tongliang Liu, quanming yao, Jianliang Xu, Bo Han

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, we verify the rationality and effectiveness of SFAT on various benchmarked and real-world datasets with different adversarial training and federated optimization methods. The code is publicly available at: https://github.com/ZFancy/SFAT. ... We conduct extensive experiments to comprehensively understand the characteristics of the proposed SFAT (Section 5.1), as well as to verify its effectiveness on improving the model performance using several representative federated optimization methods (Section 5.2).
Researcher Affiliation	Academia	Jianing Zhu1 Jiangchao Yao2,3 Tongliang Liu4 Quanming Yao5 Jianliang Xu1 Bo Han1 1Hong Kong Baptist University 2Shanghai Jiao Tong University 3Shanghai AI Laboratory 4Sydney AI Centre, The University of Sydney 5Tsinghua University
Pseudocode	Yes	For simplifying the practical use and adaptation, we can mainly assign higher weights for the client having smaller adversarial training losses to realize the relative weighting illustrated in Figure 6. To be specific, we can set the normalized Pk = ((1 + α)/(1 α) 1( Nk N Lk Lsorted[ b K]) + 1 1( Nk N Lk > Lsorted[ b K]))/((PK k=1 Pk) + 2α/(1 α)) in the aggregation to ensure the expected lower bound. Based on the α-slack mechanism, we provide a new framework for the combination of adversarial training with federated learning. It is orthogonal to a variety of different adversarial training (Zhang et al., 2019; Alayrac et al., 2019; Wang et al., 2020a; Jiang et al., 2020; Chen et al., 2021; Carmon et al., 2019; Madry et al., 2018; Chen et al., 2020; Ding et al., 2020; Li et al., 2021c; Chen et al., 2021; 2022) methods and federated optimization algorithms (Mc Mahan et al., 2017; Li et al., 2018; 2021b; Kairouz et al., 2019) which pursue the adversarial robustness or alleviate the data heterogeneity as well as other specific issues on the client side, and for the specific practical challenge for federated settings (Shah et al., 2021; Hong et al., 2021). Those all can be flexibly adopted into our framework or extend to fit other training constraints (Kairouz et al., 2019) of federated learning. E EXPERIMENTAL DETAILS AND MORE COMPREHENSIVE RESULTS In this section, we first provide the details about our experimental setups for dataset, training and evaluation. Then we provide more comprehensive results for better understanding the characteristics of our SFAT and the performance verification in different settings. In Appendix E.1, we tracing and discuss the dynamics of our SFAT on choosing the upweighted clients. In Appendix E.2, we present the results of training with different local epochs. In Appendix E.3, we present and discuss the comparison of Re-SFAT v.s. SFAT. In Appendixes E.4 and E.6, we report the client drift and variance with the corresponding robust accuracy during training. In Appendix E.8, we discuss the orthogonal effects of SFAT on intensified heterogeneity. In Appendix E.9, we verify SFAT using more clients. In Appendix E.10, we verify SFAT using unequal data splits under Non-IID setting. In Appendix E.11, we verify SFAT in a more practical real-world situation. In Appendix E.12, we report the performance results on both Non-IID and IID setting across the three benchmarked datasets. Dataset. We conduct the experiments on three benchmark datasets, i.e., SVHN (Netzer et al., 2011), CIFAR-10 and CIFAR-100 (Krizhevsky, 2009) as well as a real-world dataset Celeb A (Caldas et al., 2018) for federated adversarial training. For the IID scenario, we randomly distribute these datasets to each client. For simulating the Non-IID scenario, we follow Mc Mahan et al. (2017); Shah et al. (2021) to distribute the training data based on their labels. To be specific, a skew parameter s is utilized in the data partition introduced by Shah et al. (2021), which enables K clients to get a majority of the data samples from a subset of classes. We denote the set of all classes in a dataset as Y and create Yk by dividing all the class labels equally among K clients. Accordingly, we split the data across K clients that each client has (100 (K 1) s)% of data for the class in Yk and s% of data in other split sets. In most experiments, we set s = 2 for simulating the Non-IID partition with 5 clients as Shah et al. (2021) recommended. Published as a conference paper at ICLR 2023 Table 6: Brief summary of the basic experimental details about SFAT Dataset Network Local epochs K b K CIFAR-10 NIN (Shah et al., 2021) 10 5 1 CIFAR-100 Res Net-18 (Chen et al., 2021) 3 20 4 SVHN Small CNN (Zhang et al., 2019) 2 5 1 Training and evaluation. In the experiments, we follow the previous works (Zhang et al., 2019; Shah et al., 2021) to leverage the same architectures, i.e., NIN (Lin et al., 2014) for CIFAR-10, Res Net-18 (He et al., 2016) for CIFAR-100 and Small CNN (Zhang et al., 2019) for SVHN. For the local training batch size, we set 32 for CIFAR-10, 128 for CIFAR-100 and SVHN. For the training schedule, SGD is adopted with 0.9 momentum for 100 communication rounds under 5 clients as in (Hong et al., 2021; Shah et al., 2021), and the weight decay = 0.0001. For adversarial training, we set the configurations of PGD respectively (Madry et al., 2018) for different datasets. On CIFAR-10/CIFAR-100, we set the perturbation bound ϵ = 8/255, the PGD step size 2/255 and set the PGD step number 10. On SVHN, we set the perturbation bound ϵ = 4/255, the PGD step size 1/255. The PGD generation for all the datasets keep the same step number 10. Regarding the evaluation, the accuracy for the natural test data and that for the adversarial test data are computed following Madry et al. (2018); Zhang et al. (2019). Note that, the adversarial test data are generated by FGSM, PGD-20, C&W (Carlini & Wagner, 2017) attack with the same perturbation bound and step size as the training. All the adversarial generations have a random start, i.e, the uniformly random perturbation of [ ϵ, ϵ] added to the natural data before attacking iterations. Besides, we also report the robustness under a stronger Auto Attack, termed as AA for simplicity. All the experiments are conducted for multiple times using NVIDIA Tesla V100-SXM2. As for our SFAT, different training tasks adopt different α-slack mechanism considering different characteristic of local training data, specifically, we set α = 1/6 (i.e., 1+α 1 α = 1.4) for the experiments on CIFAR-10, and α = 1/11 (i.e., 1+α 1 α = 1.2) for the experiments on CIFAR-100 and α = 1/11 (i.e., 1+α 1 α = 1.2) for the experiments on SVHN. As for Fed Prox, we set its original hyper-parameter µ = 0.01 for each dataset and the α for our α-slack mechanism are 1/11, 1/11, and 1/11. As for Scaffold, the α adopted for previous datasets are 1/11, 3/23 and 1/11 correspondingly. As for the choice of the hyper-parameter α, one useful way to set it might be progressively probing its effect in a value-growth manner. When the α is very small, the objective will approximately degenerate the original objective of FAT, so does the performance with no harm. Slightly enlarging α can improve the performance due to the benefit on alleviating the intensified heterogeneity during aggregation, and then make a stop in one point where the performance becomes drop. E.1 THE DYNAMIC OF SFAT In this part, we present the dynamics of our SFAT about the critical α-slack mechanism. To be specific, we visualize the selected clients in each communication rounds for our slack aggregation. In one experiment on CIFAR-10 (Non-IID), we trace the index of top client (i.e., the selected client that is upweighted in our mechanism) and find the top weight dynamically routes among different clients instead of the single or a subset of 5 clients. The empirical results confirmed there is no dominant client exists during the training. To further check whether our SFAT will result in unfair attention to different clients, we investigate the difference of client s training accuracy, the accuracy gap (32.40%) of the best client with the worst client is comparable with that (32.23%) in SFAT. Intuitively, since the large adversarial training loss can automatically balance the client-wise aggregation weight, there is no unfair attention to exacerbate the performance difference among clients in our experiments, which are verified by the above gap and the similar assignments for the weighting index. Published as a conference paper at ICLR 2023 0 10 20 30 40 50 60 70 80 90 100 Communication Figure 7: The index of the topb K clients with the small losses in SFAT (α = 1/6, b K = 1) in each communication round on CIFAR-10 (Non-IID) as well as the total account. We can see that it is dynamically routing among all clients instead of fixed, and each client has similar assignments. Robust performance on each client. In Figure 7, we can find there is no dominant client exists, which means that all clients are ever chose to be up-weighted/down-weighted during the training. Thus, our SFAT explores to help all clients to be better. Besides, the global model is redistributed to each client in each communication round of federated learning, which can also avoid the bias accumulation for each single client. To verify each client performance in our experiments, we report the robust training accuracy of each client in our experiments and summarize in Table 7. The results show the training performance of each client is not worse than the original FAT. Table 7: Robust training accuracy on each client w.r.t. different methods. Setting Client Method/Client 1 2 3 4 5 CIFAR-10 FAT 59.35 65.97 75.96 77.12 80.37 SFAT 60.05 66.57 77.17 78.79 83.23 SVHN FAT 89.50 82.12 89.60 81.96 87.88 SFAT 91.69 85.58 91.93 85.31 90.35 To further check the generalization performance on each client, we also conduct an extra experiment where we split a small part of local data in each client to serve as the local test data and evaluate their training and test performance using FAT and the proposed SFAT on SVHN dataset. According to the summarized results in Table 8, the generalization performance (indicated by the gap of robust training and test accuracy) of our SFAT are also generally better than FAT in SVHN dataset. Table 8: Robust performance gap on each client w.r.t. different methods. Setting Client Method/Client 1 2 3 4 5 Train 87.51 88.19 85.91 78.99 79.17 Test 72.96 70.64 64.30 61.97 66.57 Gap 14.55 17.55 21.61 17.02 12.65 Train 87.59 88.41 85.97 79.27 79.04 Test 73.65 71.46 65.80 63.48 66.62 Gap 13.94 16.95 20.17 15.79 12.42 E.2 EXPERIMENTS WITH DIFFERENT LOCAL TRAINING EPOCHS In this part, we first report the robust test curve on CIFAR-10 dataset to compare the FAT and SFAT using different local training epoch in each client. Then we focus on a extreme setup, i.e., 1 local epoch for each client in both CIFAR-10 and SVHN datasets and report the performance comparison, although using 1 local epoch is not practical considering the heavy communication cost it introduced in real-world application (Mc Mahan et al., 2017). In Figure 8, we conduct the experiments on CIFAR-10 with different local training epochs. We find FAT exhibits robust deterioration across different settings that impedes further progress towards adversarial robustness of the federated system, and simply changing the local training epoch can not Published as a conference paper at ICLR 2023 0 20 40 60 80 100 Communication Rounds Robust Accuracy FAT SFAT Best Acc of FAT (a) Local training epoch: 10 0 25 50 75 100 125 150 175 200 Communication Rounds Robust Accuracy (b) Local training epoch: 5 100 200 300 400 500 600 700 Communication Rounds Robust Accuracy (c) Local training epoch: 1 Figure 8: Comparison between FAT and SFAT with different local training epochs. All the experiments are conducted on CIFAR-10 dataset (Non-IID) with 5 clients, and use PGD-20 (Madry et al., 2018) to evaluate the robust accuracy. achieve the significantly higher robust accuracy as the dashed black line denoted. In comparison, our SFAT can consistently achieve a higher robust accuracy than FAT by alleviating the deterioration. Even on the extreme setup, i.e., using 1 local epoch for each client, we can find the intensified heterogeneity also exist since the adversarial generation will be conducted in each optimization step, and can be further mitigate by our SFAT. Reducing the local epoch indeed can alleviate the robustness deterioration since it can reducing the original client drift (Mc Mahan et al., 2017; Li et al., 2018). As the original client drift is alleviated, the adversarial generation will inherit and exacerbate less heterogeneity. However, it does not achieve the nature of this issue and this experimental setup adjustment has similar essence with those federated optimization methods (Li et al., 2018; Karimireddy et al., 2020). The results help us to better understand the effects of the proposed SFAT on combating the intensified heterogeneity via relaxing the inner-maximization to a lower bound. Table 9: Comparison using Fed Avg on CIFAR-10 and SVHN datasets using different local epochs. Dataset Local epochs Methods Natural FGSM PGD-20 CW CIFAR-10 10 (in Table 3) FAT 57.45% 39.44% 32.58% 30.52% SFAT 63.44% 45.13% 37.17% 33.99% 1 FAT 57.61% 40.27% 33.18% 32.16% SFAT 64.85% 45.55% 37.04% 34.84% SVHN 2 (in Table 3) FAT 91.24% 87.95% 68.87% 67.39% SFAT 91.25% 88.28% 71.72% 69.79% 1 FAT 91.21% 87.99% 69.35% 68.28% SFAT 91.98% 88.99% 72.34% 71.05% In Table 9, we report and compare the results under the 1-epoch local training compared with our previous results under multiple epochs used in Table 3. The results verify the consistent effectiveness Published as a conference paper at ICLR 2023 of SFAT. Reducing the local epochs to 1 indeed can improve the performance ( 0.5%) of the baseline FAT, but might not be very practical due to the resulting large cost of the communication in federated learning (Mc Mahan et al., 2017). Even on the 1-epoch setting, as shown in Figure 8, the FAT still suffer from the robust deterioration. The results in Table 9 demonstrate that the intensified heterogeneity could be better alleviated by our SFAT. It is similar to our discussion (in Appendix E.8) about the orthogonal effects with federated optimization methods (Li et al., 2018). E.3 SFAT VS. RE-SFAT Table 10: Comparison with emphasize/de-emphasize the client with smallest loss. Setting Non-IID CIFAR-10 Natural FGSM PGD-20 CW 1 α = 1.4 emphasize 63.44% 45.13% 36.17% 33.99% SFAT: 1+α 1 α = 1.2 emphasize 62.26% 44.08% 35.83% 33.31% FAT: 1+α 1 α = 1.0 original 57.45% 39.44% 32.58% 30.52% Re-SFAT: 1+α 1 α = 0.8 de-emphasize 50.45% 34.34% 27.86% 26.62% Re-SFAT: 1+α 1 α = 0.6 de-emphasize 40.47% 28.81% 24.36% 23.19% SVHN Natural FGSM PGD-20 CW 1 α = 1.4 emphasize 90.60% 87.75% 73.12% 70.51% SFAT: 1+α 1 α = 1.2 emphasize 91.25% 88.28% 71.72% 69.79% FAT: 1+α 1 α = 1.0 original 91.24% 87.95% 68.87% 67.89% Re-SFAT: 1+α 1 α = 0.8 de-emphasize 90.03% 86.12% 64.35% 64.32% Re-SFAT: 1+α 1 α = 0.6 de-emphasize 89.46% 84.80% 58.64% 58.96% 0 10 20 30 40 50 60 70 80 Communication Rounds Robust Accuracy (a) Robust accuracy 0 10 20 30 40 50 60 70 80 Communication Rounds (b) Client drift 0 10 20 30 40 50 60 70 80 Communication Rounds Re-SFAT (1 + (c) Zoom-in Client drift Figure 9: The robust accuracy w.r.t. the client drift (Li et al., 2018) of Re-SFAT, FAT, and SFAT during training. It shows that Re-SFAT further enhances the intensified heterogeneity as well as the robustness deterioration via emphasizing the client with larger adversarial loss, which is reversed to the operation of our SFAT. This verifies the rationality of SFAT that slack the original objective. In this part, we start with the experiments about comparing the performance of oppositely using the α-slack mechanism, i.e., Re-SFAT, with our original SFAT. Then we discuss its underlying reason as well as the relationship of loss and the intensified heterogeneity. In Table 10, we conduct an empirical comparison between FAT, SFAT (which emphasizes the client model with the smallest adversarial training loss) and Re-SFAT (which is a contrary variant of SFAT that de-emphasizes the client model with the smallest adversarial training loss). Here the Re-SFAT share the same spirit with the AFL (Mohri et al., 2019), which seeks to improve the fairness and generalization through a loss-maximization reweighting. The experimental setups keep the same as Table 3 using 5 clients. Both SFAT and Re-SFAT keep the b K = 1 in all the trails. Through the results across two benchmarked datasets (i.e., CIFAR-10 and SVHN), We find that de-emphasizing the client with smallest adversarial loss (relatively emphasize those with larger adversarial loss) consistently harm the model performance across these evaluations, which shows the spirit of loss-maximization in Re-SFAT and AFL is contrary to the correct way. In contrast, emphasizing the client with smallest adversarial loss indeed improve the model performance in terms Published as a conference paper at ICLR 2023 of both natural and robust accuracies. It confirms the rationality of SFAT to alleviating intensified heterogeneity by relaxing the inner-maximization of adversarial generation during aggregation. Note that the generalization focus by AFL is under the standard federated learning instead of federated adversarial training, and the empirical results also confirm the loss-maximization actually trigger the intensified heterogeneity and lead to lower accuracies (in Table 10). Since it is out of the scope of this work on handling the intensified heterogeneity, we leave it to our future work. With respect to the experiments with standard training (in the left of Figure 4), there is no innermaximization in Eq.(3). Thus, there is no intensified heterogeneity to handle but only positive training signals from the standard training. In this case, adding the inequality with slack in Eq.(3) only induces the extra objective bias to the standard federated learning. Correspondingly, as shown in the left-most panel of Figure 4, such an operation instead degenerates the model performance. Discussion about the loss and the intensified heterogeneity. It is the learning dynamic of Eq. 3 that upweights the client model with the smaller losses and downweights the client model with the larger losses to slack the overall objective. Possibly, the optimization bias (or termed as client drift more rigorously) may result in a smaller loss and sometimes the smaller loss may not absolutely indicate the smaller optimization bias. We have verified that when the opposite does not hold and the larger loss is preferred in the selection in the middle panel of Figure 4 (see Re-SFAT v.s. SFAT) and the above Table 10 (see Re-SFAT v.s. SFAT). The Re-SFAT is actually construct an upper bound of the FAT objective, which is different from SFAT that relax to the lower bound of the FAT objective. The results empirically verify the failure of this case that prefers the larger loss. Here we discuss the connection between the slack mechanism and the intensified heterogeneity (or roughly call intensified optimization bias). To alleviate the intensified heterogeneity, our slack mechanism naturally relaxes the overall objective and constructs a mediating function that asymptotically approaches the original goal and reducing the negative impact of the intensified heterogeneity on the training. Our reasonable analysis and comprehensive evidence from both theoretical (e.g., Theorems 4.2 and 4.3) and empirical (e.g., Figures 3, 10, and 11) views have demonstrated its rationality. In Tables 3 and 19, the multiple experimental results with random non-iid data distribution as well as the real-world datasets demonstrates the empirical superiority of our proposed SFAT over the original FAT. E.4 SFAT CORRESPONDING TO CLIENT DRIFT 0 10 20 30 40 50 60 70 80 90 100 Communication Rounds (a) Client drift 0 10 20 30 40 50 60 70 80 90 100 Communication Rounds Robust Accuracy (b) Robust accuracy Figure 10: The client drift corresponding to the robust accuracy of FAT and our SFAT during training. In this part, we present more results about the client drift during training using FAT and our SFAT than Figure 3. The corresponding curve during training can be more clearly to show how SFAT alleviate the intensified heterogeneity and achieve the higher robust accuracy. We conduct more experiments about the client drift of FAT and our SFAT on CIFAR-10 (Non-IID) in Figure 10. As the α increasing, our SFAT can further alleviate the intensified heterogeneity (having smaller client drift value as shown in Figure 10(a)) of FAT at the later stage which corresponds to the robustness deterioration (exhibiting less obvious deterioration while achieving higher robust accuracy as shown in Figure 10(b)). Unlike FAT that hindered by the intensified heterogeneity, our SFAT can improve the robust accuracy by combating the issue of intensified heterogeneity. Published as a conference paper at ICLR 2023 E.5 ADDITIONAL DISCUSSION ON THE α-SLACK MECHANISM Here, we first empirically verify the difference between Fed Soft Better and our SFAT, we conduct the experiments on CIFAR-10 and SVHN to compare their performance in the following table. The results show that its empirical performance is slightly better than Fed Avg but not better than SFAT. Table 11: Comparison of different methods on CIFAR-10 and SVHN datasets. Dataset Methods Natural FGSM PGD-20 CW CIFAR-10 FAT 57.45% 39.44% 32.58% 30.52% Fed Soft Better 58.86% 40.23% 32.78% 30.76% SFAT 63.44% 45.13% 37.17% 33.99% SVHN FAT 91.24% 87.95% 68.87% 67.39% Fed Soft Better 91.64% 88.53% 69.07% 67.83% SFAT 91.25% 88.28% 71.72% 69.79% In the following, we present the experiments that progressively anneal the coefficient to vanilla federated adversarial training during the training and compare with SFAT on CIFAR-10 and SVHN. According to the results in Table 12, gradually annealing-α SFAT achieves slightly lower robust accuracy than the constant-α SFAT and sometimes may introduces the large drop of natural accuracy (e.g., on CIFAR-10). It indicates that keeping a consistent α for the α-slack maybe a better choice. On the other hand, it also confirms the intuition that using a gradually decreased alpha is empirically contrary to the observation of robustness deterioration (or the heterogeneity exacerbation as shown in Figure 10) at the later stage of training. Table 12: Comparison on CIFAR-10 and SVHN datasets using different α schedule. Dataset α-slack Methods Natural FGSM PGD-20 CW CIFAR-10 FAT 57.45% 39.44% 32.58% 30.52% 1+α 1 α : 1.4 SFAT 63.44% 45.13% 37.17% 33.99% 1+α 1 α : 1.4 1.0 SFAT 60.63% 43.53% 36.23% 33.22% SVHN FAT 91.24% 87.95% 68.87% 67.39% 1+α 1 α : 1.2 SFAT 91.25% 88.28% 71.72% 69.79% 1+α 1 α : 1.2 1.0 SFAT 91.68% 88.55% 71.44% 69.76% E.6 SFAT CORRESPONDING TO GRADIENT VARIANCE 0 10 20 30 40 50 60 70 80 90 100 Communication Rounds Gradient Variance (a) Gradient Variance 0 10 20 30 40 50 60 70 80 90 100 Communication Rounds Robust Accuracy (b) Robust accuracy Figure 11: The gradient variance corresponding to the robust accuracy of FAT and our SFAT. To further factorize how the inner maximization affects the intensified heterogeneity of the local data, we add the experiments about the variance of the client s gradients (calculated by parameters difference). We find the similar trend with the empirical results about client drift in Figure 11, i.e., SFAT prevents exacerbated gradients variance in FAT, and results in better performance on robustness. Published as a conference paper at ICLR 2023 E.7 EMPIRICAL VERIFICATION ABOUT OUR THEORETICAL ANALYSIS. 0 20 40 60 80 100 Communication Rounds 0 20 40 60 80 100 Communication Rounds Fed Avg SFAT Figure 12: Empirical verification of our theoretical analyze. Left panel: Empirical estimation about E\| \| in Theorem 4.2 of the original and our slacked objective using CIFAR10 dataset. Right panel: Empirical estimation about E\|LSFAT\| L in Theorem 4.3 of FAT and our SFAT using SVHN dataset. The overall results confirm the benefit on convergence from our proposed α-slack mechanism. Here, we provide the empirical verification about the theoretical claim in Theorem 4.2 and Theorem 4.3 via tracing the training loss on CIFAR-10 and SVHN datasets. In Figure 12, we present the empirical estimation of the RHS of Eq. (2) and Eq. (4) via tracing the training loss. The results show that our our α-slack mechanism can achieve a faster convergence compared with the original objective for the convergence of adversarial training in Theorem 4.2 and the federated case in Theorem 4.3. E.8 EXPERIMENTS ABOUT THE ORTHOGONAL EFFECTS OF SFAT ON CLIENT DIRFT. In this part, we start with the discussion about the orthogonal effects of SFAT from the conceptual perspective, and then present the orthogonal effects of SFAT on combating the intensified heterogeneity via tracing the client drift from the experimental perspective. From the problem view, our slack mechanism is targeted for the intensified heterogeneity. However, similar to other techniques in federated learning literature, Fed Prox (Li et al., 2018) is designed for the original heterogeneous data instead of considering the intensified process. The two problems are discussed in our Appendix A. Although intensified heterogeneity and the ordinary heterogeneity (i.e., without the inner-maximization) both induce the client drift, the effects and extent are different. To be specific, the intensified heterogeneity will result in more diverged models and exacerbate the difference among client models compared with the ordinary heterogeneity. This intensification process is not considered in Fed Prox or other related work (Mc Mahan et al., 2017; Karimireddy et al., 2020), and SFAT is orthogonal to them further incorporates a slack mechanism to avoid its influence. Table 13: Client drift w.r.t Epochs on CIFAR-10 in Figure 3. Setting Client drift Method / Epoch 10 20 30 40 50 60 70 80 90 100 Fed Avg FAT 11.56 13.10 15.04 17.30 19.22 20.66 21.72 22.47 23.02 23.49 SFAT 12.21 13.08 14.37 15.95 17.68 19.04 20.16 20.92 21.50 21.86 Fed Prox FAT 4.71 4.58 5.98 6.54 7.83 8.26 9.12 10.23 11.71 12.61 SFAT 5.89 5.27 5.67 6.02 6.87 7.34 7.98 8.77 9.65 10.81 From the experimental view, we also provide the comparison between SFAT and Fed Prox (actually they are orthogonal and combinable). We show the results of client drift w.r.t different methods in the Tables 13 and 14. According to the results, FAT under the backbone Fed Prox can indeed catch up with our SFAT on basis of Fed Avg since the Fed Prox is designed for reducing the client drift. However, when adopting Fed Prox as the backbone of SFAT, the intensified client drift can be further reduced, which verified the orthogonal effects (as stated in our Appendix A) on reducing intensified heterogeneity in the critical issue of federated adversarial training. Note that, we are not focus on the relationship of normal heterogeneity (Li et al., 2018) (e.g., the value of client drift) with the robust Published as a conference paper at ICLR 2023 Table 14: Client drift w.r.t Epochs on CIFAR-10 in the setting of unequal splits with 5 clients. Setting Client drift Method / Epoch 10 20 30 40 50 60 70 80 90 100 Fed Avg FAT 13.96 16.91 19.89 22.52 24.40 25.75 26.76 27.64 28.13 28.53 SFAT 14.15 16.71 19.37 21.68 23.33 24.24 25.39 26.11 26.75 26.93 Fed Prox FAT 5.36 5.89 6.82 7.96 9.36 10.89 12.52 14.03 15.41 16.47 SFAT 6.28 6.27 6.59 7.10 7.68 8.10 9.06 9.80 10.60 11.14 performance in federated adversarial training. Instead, our proposed SFAT actually focus on the robust deterioration caused by the intensified heterogeneity (e.g., the increasing trend of client drift). It is orthogonal and compatible to those previous federated optimization methods. Table 15: Comparison of SFAT and FAT using Fed Prox with different paramter µ. Fed Prox: µ Methods Natural PGD-20 CW 0.01 FAT 90.92% 68.44% 67.18% SFAT 91.25% 71.54% 69.53% 0.05 FAT 90.25% 68.07% 66.88% SFAT 91.37% 70.58% 68.93% 0.1 FAT 89.98% 67.15% 65.94% SFAT 90.95% 71.89% 69.98% Except the previous results, we further strength the hyper-parameter µ of the proposed proximal term i.e., µ 2 \|\|w wt\|\|2 in Fed Prox to verify the improvement of using our SFAT on SVHN dataset. We summarize the results in Table 15. The results show that increasing the µ from 0.01 (adopted in our experiments and following the recommendation of Fed Prox) to 0.1, the robust performance even worse while our SFAT can still reach the better performance. The reason may be that too large µ also has the potentially negative influence on the convergence of the training by forcing the updates to be close to the starting point, which has been discussed in previous literature (Karimireddy et al., 2020). E.9 EXPERIMENTS WITH MORE CLIENTS Table 16: Comparison about FAT with SFAT on Non-IID data partition with different client numbers Setting Non-IID Client Number / Method CIFAR-10 SVHN CIFAR-100 Natural PGD-20 CW Natural PGD-20 CW Natural PGD-20 CW 10 FAT 56.62% 31.24% 29.82% 91.42% 69.65% 68.52% 33.27% 16.81% 14.12% SFAT 56.67% 33.31% 31.58% 91.84% 72.59% 70.71% 34.17% 17.66% 14.25% 20 FAT 60.55% 32.67% 31.07% 92.14% 70.32% 69.48% 31.49% 15.35% 13.18% SFAT 62.24% 35.66% 33.21% 92.75% 72.06% 71.14% 34.04% 16.05% 13.70% 25 FAT 58.97% 32.98% 31.14% 92.32% 70.54% 69.84% 32.64% 15.82% 13.23% SFAT 62.73% 35.75% 33.16% 92.33% 71.99% 71.06% 34.19% 16.37% 13.63% 50 FAT 56.74% 32.91% 30.50% 91.97% 70.84% 69.42% 34.46% 15.97% 13.59% SFAT 57.21% 34.35% 31.75% 91.99% 71.87% 70.74% 34.82% 16.34% 13.93% In this part, we verify SFAT using more clients under the Non-IID data partition with the three benchmarked datasets, i.e., CIFAR-10, SVHN and CIFAR-100. In Table 16, we change the client number from 10 to 50 to investigate the scalability of our SFAT. For each client setting, we conduct FAT and SFAT to compare their performance on both natural test data and adversarial test data. In the experiments, we set b K = K/5 and α = 1/11 for our SFAT and keep the other basic setups as the same with previous experiments. We can find that the results further confirm the effectiveness of SFAT on improving both natural and robust performance when training with different client numbers. Published as a conference paper at ICLR 2023 E.10 EXPERIMENTS ON THE UNEQUAL DATA SPLITS. Table 17: Performance on the setting with unequal data splits among clients. Setting Non-IID Dataset Client Sample Opt. Method Natural FGSM PGD-20 CW CIFAR-10 5 6000-13000 Fed Avg FAT 59.98% 40.57% 31.50% 29.57% SFAT 61.70% 42.81% 33.87% 30.99% Fed Prox FAT 60.36% 40.36% 31.90% 29.00% SFAT 60.65% 42.38% 35.16% 30.93% 10 1000-8000 Fed Avg FAT 61.67% 42.69% 33.17% 30.58% SFAT 62.57% 44.85% 36.42% 32.65% Fed Prox FAT 60.24% 41.25% 33.21% 30.98% SFAT 60.78% 43.73% 36.76% 32.44% SVHN 5 5860-26370 Fed Avg FAT 89.42% 85.93% 68.35% 67.03% SFAT 90.57% 87.53% 70.56% 68.67% Fed Prox FAT 90.15% 86.59% 68.02% 66.22% SFAT 90.55% 87.45% 71.19% 69.00% 10 1465-13185 Fed Avg FAT 91.84% 88.80% 70.66% 68.90% SFAT 91.55% 88.91% 72.29% 70.30% Fed Prox FAT 90.95% 87.77% 69.80% 68.20% SFAT 91.61% 88.83% 72.45% 70.29% Table 18: Performance on the setting with severe unequal splits among clients on SVHN dataset. Setting Non-IID Dataset Client Sample Opt. Method Natural FGSM PGD-20 CW SVHN 10 1465-13185 Fed Avg FAT 91.84% 88.80% 70.66% 68.90% SFAT 91.55% 88.91% 72.29% 70.30% Fed Prox FAT 90.95% 87.77% 69.80% 68.20% SFAT 91.61% 88.83% 72.45% 70.29% 10 50-16700 Fed Avg FAT 93.14% 90.23% 72.09% 71.01% SFAT 92.78% 90.07% 73.85% 72.08% Fed Prox FAT 93.06% 90.37% 72.03% 70.90% SFAT 93.14% 90.51% 73.87% 72.35% To complete our experimental verification on those unequal data splits, we conduct the experiments on CIFAR-10 and SVHN datasets with different client numbers. We summarize the results in Table 17. In addition, we also add Table 18 to explore the more severe unequal splits in different clients. In such setups, the sample numbers of different client are also a critical factor to the optimization. More samples the client has, larger weights the local model has in the aggregation phase. In comparison with our α, it follows the support of the statistical sample proportion while our α utilizes the clues of the local loss related to the adversarial training. According to the results, we can see that SFAT is approximately orthogonal to the sample number effect and still more effective than FAT. From the algorithm level, our algorithm (refer to Algorithm 1) and the framework (refer to Figure 6) has taken the numbers of local data into consideration. To be specific, when we conduct the slack selection, all the adversarial training loss are normalized by the local data number. E.11 EXPERIMENTS ON THE REAL-WORLD SCENARIOS To verify the effectiveness of our SFAT in more practical situation, we conduct the experiments using a real-world dataset Celeb A in the benchmark of federated learning, i.e., LEAF (Caldas et al., 2018), Published as a conference paper at ICLR 2023 Table 19: Performance on Non-IID settings using the real-world dataset, i.e., Celeb A Setting Non-IID Celeb A Natural FGSM PGD-20 CW Fed Avg FAT 57.62% 42.20% 22.20% 21.67% SFAT 58.50% 43.44% 24.14% 23.52% Fed Prox FAT 57.70% 41.85% 22.29% 21.50% SFAT 58.50% 43.08% 24.14% 23.61% with hundreds (455) of clients in Table 19. We follow the most settings in LEAF to perform our experiments using different federated optimization methods, and we set b K = 2 K/5 and α = 1/6 for all the experiments of our SFAT. In Table 19, we confirm the effectiveness of our SFAT using a real-world large-scale dataset Celeb A with 455 clients. Except Scaffold that fails to converge in FAT and thus is not reported, our SFAT again gains significantly better natural and robust accuracies than FAT under Fed Avg and Fed Prox. Table 20: Test accuracy (%) on SVHN with different participation ratio. Setting Non-IID Accuracy / Participation Ratio 0.2 0.4 0.6 0.8 1.0 Natural FAT 91.93% 91.39% 92.19% 92.31% 92.14% SFAT 91.61% 92.51% 92.30% 92.53% 92.75% PGD-20 FAT 69.63% 69.97% 70.25% 70.24% 70.32% SFAT 72.37% 73.01% 72.85% 72.59% 72.06% In addition, we also consider the practical situation where only a subset of clients participates in each round. Following the same settings as previous section, we add the experiments on SVHN dataset in Table 20 with 20 clients. The results show that the lower participation ratio leads to lower natural and PGD-20 accuracy while our SFAT can consistently outperform FAT on the robustness. E.12 OVERALL RESULTS ON BOTH NON-IID AND IID SETTINGS Here we provide the overall results for comparison on both Non-IID and IID settings in Table 21. For the Non-IID data, our SFAT gain consistently improvement across the various of evaluation metrics and datasets. For the IID data, our method acquires a similar improvement on the robust accuracy without the deterioration of the natural accuracy. The reason might be that even the data is IID, adversarial training can still drive the independently-initialized overparameterized network (Allen-Zhu & Li, 2020) on each client side towards at the robust overfitting of different directions, yielding the model heterogeneity. Thus, the proper slack to the inner-maximization makes adversarial training more compatible with federated learning. Another interesting observation is that Federated Adversarial Training shows better performance than centralized adversarial training in the IID setting. This gain could be from the distributed training paradigm that helps adversarial training converge to the more robust optimum by the divide-and-conquer mechanism. This might enlighten the more explore in adversarial training to improve the robustness via federated learning. F FURTHER DISCUSSION Adversarial robustness is an important topic in the centralized machine learning. The adversarial training are confirmed to be one of the most effective empirical defenses against the adversarial attack, which is critical especially for those safety-critical areas like medicine and finance. In federated settings, how to train an adversarially robust model is a challenging but practical task for the increasing concern about data privacy. In this work, we observe and explore to combat the intensified heterogeneity in federated adversarial training. Different from the conventional FAT adopted by previous works, we propose a new learning framework, i.e., SFAT, which relaxes the exacerbated heterogeneous effect and is compatible with the various adversarial training (Madry et al., 2018; Zhang et al., 2019) and federated optimization methods (Li et al., 2018; Karimireddy et al., 2020). Although we take a step forward in FAT, it is not the end of this direction since there are still many problems to be addressed to further enhance the practicality of federated adversarial training. From the perspective of adversarial robustness, adversarial attacks can be very complex, especially in a decentralized environment, while the adversarial robustness discussed in this paper mainly focuses on common adversarial attacks (e.g., L -bounded attack) (Goodfellow et al., 2015). More practical situation which contains different kind of adversarial attack (e.g., mixed types of attack with L bounded attack, Spatially transformed attack (Xiao et al., 2018)), even only considering the inference phase, may also happened since there are different clients may meet different threaten (Kairouz et al., 2019; Yao et al., 2022). Besides, the federated adversarial training that requires multiple local runs (Madry et al., 2018; Zhang et al., 2019) also introduces the extra computation to the low-capacity devices, which is computational bottleneck and requires some lightweight techniques. Except for the empirical defense strategy focused by our work, the certifiable robustness (Cohen et al., 2019; Zizzo et al., 2021; Alfarra et al., 2022) which can give the theoretical guarantees is also important. From the perspective of federated learning, our SFAT shares the similar spirit of the conventional federated adversarial training. The first part of challenges comes from the distributed learning Published as a conference paper at ICLR 2023 paradigm (Mc Mahan et al., 2017; Kairouz et al., 2019; Li et al., 2020) of federated setting, which brings the hardware constraint that considers the computational capacity of local clients and communication cost between clients and server (Hong et al., 2021). Once these conditions do not satisfy, both FAT and SFAT would not work well. The second may from the algorithm design for some special training or inference issues, like dealing with heterogeneous data (Li et al., 2018; Zhao et al., 2018), class-imbalance data or even some out-of-distribution data at inference time. On the other hand, the decentralized structure of the learning paradigm also introduces various issues on information transferring for server and clients. The current federated adversarial training still needs large improvement considering the practical cases that may happened in the federated setting. For the intensified heterogeneity, it can also be recognized as an dynamical heterogeneous issue existing in federated learning, which may result from the special learning algorithm adapted in the distributed framework or other data manipulation scenarios. More robust issues, like robust distillation (Goldblum et al., 2020; Zhu et al., 2022), train-test distribution shift (Jiang & Lin, 2023), out-of-distribution detection (Yu et al., 2023), under the federated framework can be further explored in the future.
Open Source Code	Yes	The code is publicly available at: https://github.com/ZFancy/SFAT.
Open Datasets	Yes	We conduct the experiments on three benchmark datasets, i.e., CIFAR-10, CIFAR100 (Krizhevsky, 2009), SVHN (Netzer et al., 2011) as well as a real-world dataset Celeb A (Caldas et al., 2018) for federated adversarial training.
Dataset Splits	No	The paper describes training and test data partitioning but does not explicitly provide details for a separate validation split. It states: "For the IID scenario, we just randomly and evenly distribute the samples to each client. For the Non-IID scenario, we follow Mc Mahan et al. (2017); Shah et al. (2021) to partition the training data based on their labels. To be specific, a skew parameter s is utilized in the data partition introduced by Shah et al. (2021), which enables K clients to get a majority of the data samples from a subset of classes. ... In the test phase, we evaluate the model s standard performance using natural test data and its robust performance using adversarial test data...".
Hardware Specification	Yes	All the experiments are conducted for multiple times using NVIDIA Tesla V100-SXM2.
Software Dependencies	No	The paper mentions general software components like SGD optimizer but does not provide specific version numbers for key software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For the local training batch size, we set 32 for CIFAR-10, 128 for CIFAR-100 and SVHN. For the training schedule, SGD is adopted with 0.9 momentum for 100 communication rounds under 5 clients as in (Hong et al., 2021; Shah et al., 2021), and the weight decay = 0.0001. For adversarial training, we set the configurations of PGD respectively (Madry et al., 2018) for different datasets. On CIFAR-10/CIFAR-100, we set the perturbation bound ϵ = 8/255, the PGD step size 2/255 and set the PGD step number 10. On SVHN, we set the perturbation bound ϵ = 4/255, the PGD step size 1/255. The PGD generation for all the datasets keep the same step number 10.