Federated Q-Learning: Linear Regret Speedup with Low Communication Cost

Authors: Zhong Zheng, Fengyu Gao, Lingzhou Xue, Jing Yang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct experiments in a synthetic environment to validate the theoretical performances of Fed Q-Hoeffding, Fed Q-Beinstein, and compare with their single-user counterparts UCB-H and UCB-B (Jin et al., 2018), respectively.
Researcher Affiliation Academia Zhong Zheng, Fengyu Gao, Lingzhou Xue & Jing Yang The Pennsylvania State University {zvz5337,fzg5170,lzxue,yangjing}@psu.edu
Pseudocode Yes Algorithms 1 and 2 formally present the Hoeffding-type design. [...] Algorithm 1 Fed Q-Hoeffding (Central Server) [...] Algorithm 2 Fed Q-Hoeffding (Agent m in round k)
Open Source Code Yes Numerical experiments in this paper can be fully reproduced via the publicly available code3. 3https://openreview.net/attachment?id=fe6ANBxcKM&name=supplementary_material
Open Datasets No In this section, we conduct experiments in a synthetic environment to validate the theoretical performances of Fed Q-Hoeffding, Fed Q-Beinstein, and compare with their single-user counterparts UCB-H and UCB-B (Jin et al., 2018), respectively.
Dataset Splits No The paper describes experiments in a 'synthetic environment' but does not specify any train/validation/test splits or their proportions, as data is generated through interaction in an RL setting.
Hardware Specification No The paper describes conducting 'numerical experiments' but does not specify any hardware details such as CPU, GPU models, or memory.
Software Dependencies No The paper does not provide specific software names with version numbers (e.g., Python 3.x, PyTorch 1.x) that would allow for reproducible setup of software dependencies.
Experiment Setup Yes We set the number of states S to be 3, the number of actions A for each state to be 2, and the episode length H to be 5. The reward rh(s, a) for each state-action pair and each step is generated independently and uniformly at random from [0, 1]. We also generate the transition kernel Ph( | s, a) from an S-dimensional simplex independently and uniformly at random for each state-action pair and each step. Such procedure guarantees that the synthetic environment is a proper tabular MDP. Under the given MDP, we set M = 10 and T/H = 3 10^4 for Fed Q-Hoeffding, Fed Q-Beinstein, and T/H = 3 10^5, M = 1 for UCB-H and UCB-B. Thus, the total number of episodes is 3 10^5 for all four algorithms. We choose c = ι = 1 for all algorithms.