Collaborative Linear Bandits with Adversarial Agents: Near-Optimal Regret Bounds
Authors: Aritra Mitra, Arman Adibi, George J. Pappas, Hamed Hassani
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We report simulation results on synthetic data to corroborate our developed theory. Additional simulations on contextual bandits and alternate attack models are presented in Appendix H. Experimental Setup. We consider a setting with 50 actions in R5; we describe how these actions and θ are generated in Appendix H. The rewards are generated based on the observation model in Eq. (1). We now describe the attack model. To manipulate the server into selecting sub-optimal arms, each adversarial agent i employs the simple strategy of reducing the rewards of the good arms and increasing the rewards of the bad arms. More precisely, in each epoch ℓ, upon pulling an arm a and observing the corresponding reward ya, an adversarial agent i does the following: if ya > p θ , a , then this reward is corrupted to ya = ya β; and if ya p θ , a , then the reward is corrupted to ya = ya + β. For this experiment, we fix p = 0.6 and β = 5. Agent i B then uses all the corrupted rewards in epoch ℓto generate the local model estimate ˆθ(ℓ) i that is transmitted to the server. Discussion of Simulation Results. Fig. 1 summarizes our experimental results. In Fig. 1(a), we compare our proposed algorithm RCLB to a vanilla distributed phased elimination (PE) algorithm that does not account for adversarial agents. Specifically, the latter is designed by replacing the median operation in line 8 of Algorithm 1 with a mean operation, and setting the threshold γℓin line 9 to be ϵℓ. Fig. 1(a) shows that even a small fraction α = 0.1 of adversaries can cause the non-robust PE algorithm to incur linear regret. In contrast, RCLB continues to guarantee sub-linear regret bounds despite adversarial corruptions. Furthermore, the regret bound of RCLB in the presence of a small fraction of adversarial agents is close to that of the non-robust algorithm in the absence of adversaries. This goes on to establish the robustness of RCLB. |
| Researcher Affiliation | Academia | Aritra Mitra Arman Adibi George J. Pappas Hamed Hassani Department of Electrical and Systems Engineering {amitra20, aadibi, hassani, pappasg}@seas.upenn.edu |
| Pseudocode | Yes | Algorithm 1 Robust Collaborative Phased Elimination for Linear Bandits (RCLB) Algorithm 2 Robust Base Lin UCB (at Server) Algorithm 3 Robust Collaborative Sup Lin UCB for Contextual Bandits (at Server) |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The paper uses "synthetic data" generated for the experiments and describes the process of generating this data in Appendix H, but it does not specify that this data is publicly available or provide concrete access information (link, DOI, citation for an established dataset). |
| Dataset Splits | No | The paper discusses an experimental setup using synthetic data but does not specify dataset splits (e.g., training, validation, test percentages or counts) or cross-validation setup. |
| Hardware Specification | No | The paper describes the experimental setup but does not provide any specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running the experiments. |
| Software Dependencies | No | The paper mentions simulation results but does not specify any software dependencies with version numbers (e.g., Python, specific libraries, or solvers). |
| Experiment Setup | Yes | Experimental Setup. We consider a setting with 50 actions in R5; we describe how these actions and θ are generated in Appendix H. The rewards are generated based on the observation model in Eq. (1). We now describe the attack model. To manipulate the server into selecting sub-optimal arms, each adversarial agent i employs the simple strategy of reducing the rewards of the good arms and increasing the rewards of the bad arms. More precisely, in each epoch ℓ, upon pulling an arm a and observing the corresponding reward ya, an adversarial agent i does the following: if ya > p θ , a , then this reward is corrupted to ya = ya β; and if ya p θ , a , then the reward is corrupted to ya = ya + β. For this experiment, we fix p = 0.6 and β = 5. Agent i B then uses all the corrupted rewards in epoch ℓto generate the local model estimate ˆθ(ℓ) i that is transmitted to the server. |