Decentralized Reinforcement Learning: Global Decision-Making via Local Economic Transactions

Authors: Michael Chang, Sid Kaushik, S. Matthew Weinberg, Tom Griffiths, Sergey Levine

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental At the fourth level, we empirically investigate various implementations of the cloned Vickrey society under our decentralized reinforcement learning algorithm and find that a particular set of design choices, which we call the credit conserving Vickrey implementation, yields both the best performance at the societal and and agent level. Lastly, we demonstrate the potential advantages of a society s inherent modular structure for more efficient transfer learning. We study how well the cloned Vickrey society can recover the optimal societal Q-function as its Nash equilibrium. We compare several implementations of the cloned Vickrey society against baselines across simple tabular environments in Section 7.1, where the transformations φT are literal actions. Then in Section 7.2 we demonstrate the broad applicability of the cloned Vickrey society for learning to select options in semi-MDPs and composing functions in a computation graph.
Researcher Affiliation Academia 1Department of Computer Science, University of California, Berkeley, USA 2Department of Computer Science, Princeton University, USA. Correspondence to: Michael Chang <mbchang@berkeley.edu>.
Pseudocode Yes An on-policy learning algorithm is presented in Appendix D. (Appendix D: Algorithm 1: Decentralized Reinforcement Learning)
Open Source Code Yes 1Code, talk, and blog here.
Open Datasets Yes We construct a two room environment, Two Rooms, which requires opening the red door and reaching either a green goal or blue goal. We adapt the Image Transformations task from Chang et al. (2018) as what we call the Mental Rotation environment (Figure 9), in which MNIST images have been transformed with a composition of one of two rotations and one of four translations.
Dataset Splits No The paper describes the specific environments and tasks used for experiments (e.g., Market Bandit, Chain, Duality, Two Rooms, Mental Rotation) and how the agents interact within them. However, it does not specify how any larger dataset for these environments (if applicable beyond the environment definition itself) was partitioned into explicit train/validation/test splits, nor does it provide percentages or sample counts for such splits for reproducibility.
Hardware Specification No The acknowledgements section states 'This research was supported by... Google Cloud Platform.' However, it does not provide specific details about the hardware used on Google Cloud Platform, such as GPU models, CPU types, or memory configurations.
Software Dependencies No We used proximal policy optimization (PPO) (Schulman et al., 2017) to optimize the bidding policy parameters. The paper mentions Pytorch and PPO, but it does not specify the version numbers for these or any other software libraries or dependencies used in the experiments.
Experiment Setup No The paper describes the implementation of the bidding policy as a neural network that maps state to Beta distribution parameters, and states that PPO was used for optimization. It also details the design of the experimental environments (Market Bandit, Chain, Duality, Two Rooms, Mental Rotation). However, it does not provide specific hyperparameters for the PPO algorithm (e.g., learning rate, batch size, network architecture details), nor does it list other crucial training configurations required for direct reproduction.