Federated Q-Learning: Linear Regret Speedup with Low Communication Cost
Authors: Zhong Zheng, Fengyu Gao, Lingzhou Xue, Jing Yang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct experiments in a synthetic environment to validate the theoretical performances of Fed Q-Hoeffding, Fed Q-Beinstein, and compare with their single-user counterparts UCB-H and UCB-B (Jin et al., 2018), respectively. |
| Researcher Affiliation | Academia | Zhong Zheng, Fengyu Gao, Lingzhou Xue & Jing Yang The Pennsylvania State University {zvz5337,fzg5170,lzxue,yangjing}@psu.edu |
| Pseudocode | Yes | Algorithms 1 and 2 formally present the Hoeffding-type design. [...] Algorithm 1 Fed Q-Hoeffding (Central Server) [...] Algorithm 2 Fed Q-Hoeffding (Agent m in round k) |
| Open Source Code | Yes | Numerical experiments in this paper can be fully reproduced via the publicly available code3. 3https://openreview.net/attachment?id=fe6ANBxcKM&name=supplementary_material |
| Open Datasets | No | In this section, we conduct experiments in a synthetic environment to validate the theoretical performances of Fed Q-Hoeffding, Fed Q-Beinstein, and compare with their single-user counterparts UCB-H and UCB-B (Jin et al., 2018), respectively. |
| Dataset Splits | No | The paper describes experiments in a 'synthetic environment' but does not specify any train/validation/test splits or their proportions, as data is generated through interaction in an RL setting. |
| Hardware Specification | No | The paper describes conducting 'numerical experiments' but does not specify any hardware details such as CPU, GPU models, or memory. |
| Software Dependencies | No | The paper does not provide specific software names with version numbers (e.g., Python 3.x, PyTorch 1.x) that would allow for reproducible setup of software dependencies. |
| Experiment Setup | Yes | We set the number of states S to be 3, the number of actions A for each state to be 2, and the episode length H to be 5. The reward rh(s, a) for each state-action pair and each step is generated independently and uniformly at random from [0, 1]. We also generate the transition kernel Ph( | s, a) from an S-dimensional simplex independently and uniformly at random for each state-action pair and each step. Such procedure guarantees that the synthetic environment is a proper tabular MDP. Under the given MDP, we set M = 10 and T/H = 3 10^4 for Fed Q-Hoeffding, Fed Q-Beinstein, and T/H = 3 10^5, M = 1 for UCB-H and UCB-B. Thus, the total number of episodes is 3 10^5 for all four algorithms. We choose c = ι = 1 for all algorithms. |