reproducibilityindex.ai

Federated Q-Learning: Linear Regret Speedup with Low Communication Cost

Authors: Zhong Zheng, Fengyu Gao, Lingzhou Xue, Jing Yang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct experiments in a synthetic environment to validate the theoretical performances of Fed Q-Hoeffding, Fed Q-Beinstein, and compare with their single-user counterparts UCB-H and UCB-B (Jin et al., 2018), respectively.
Researcher Affiliation	Academia	Zhong Zheng, Fengyu Gao, Lingzhou Xue & Jing Yang The Pennsylvania State University {zvz5337,fzg5170,lzxue,yangjing}@psu.edu
Pseudocode	Yes	Algorithms 1 and 2 formally present the Hoeffding-type design. [...] Algorithm 1 Fed Q-Hoeffding (Central Server) [...] Algorithm 2 Fed Q-Hoeffding (Agent m in round k)
Open Source Code	Yes	Numerical experiments in this paper can be fully reproduced via the publicly available code3. 3https://openreview.net/attachment?id=fe6ANBxcKM&name=supplementary_material
Open Datasets	No	In this section, we conduct experiments in a synthetic environment to validate the theoretical performances of Fed Q-Hoeffding, Fed Q-Beinstein, and compare with their single-user counterparts UCB-H and UCB-B (Jin et al., 2018), respectively.
Dataset Splits	No	The paper describes experiments in a 'synthetic environment' but does not specify any train/validation/test splits or their proportions, as data is generated through interaction in an RL setting.
Hardware Specification	No	The paper describes conducting 'numerical experiments' but does not specify any hardware details such as CPU, GPU models, or memory.
Software Dependencies	No	The paper does not provide specific software names with version numbers (e.g., Python 3.x, PyTorch 1.x) that would allow for reproducible setup of software dependencies.
Experiment Setup	Yes	We set the number of states S to be 3, the number of actions A for each state to be 2, and the episode length H to be 5. The reward rh(s, a) for each state-action pair and each step is generated independently and uniformly at random from [0, 1]. We also generate the transition kernel Ph( \| s, a) from an S-dimensional simplex independently and uniformly at random for each state-action pair and each step. Such procedure guarantees that the synthetic environment is a proper tabular MDP. Under the given MDP, we set M = 10 and T/H = 3 10^4 for Fed Q-Hoeffding, Fed Q-Beinstein, and T/H = 3 10^5, M = 1 for UCB-H and UCB-B. Thus, the total number of episodes is 3 10^5 for all four algorithms. We choose c = ι = 1 for all algorithms.