Maximum a Posteriori Policy Optimisation

Authors: Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, Martin Riedmiller

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We develop two off-policy algorithms and demonstrate that they are competitive with the state-of-the-art in deep reinforcement learning. In particular, for continuous control, our method outperforms existing methods with respect to sample efficiency, premature convergence and robustness to hyperparameter settings.
Researcher Affiliation Industry Deep Mind, London, UK {aabdolmaleki,springenberg,tassa,munos,heess,riedmiller}@google.com
Pseudocode Yes Algorithm 1 MPO (chief), Algorithm 2 MPO (worker) Non parametric variational distribution, Algorithm 3 MPO (worker) parametric variational distribution
Open Source Code No The paper states: "This suite of tasks was built in python on top of mujoco and will also be open sourced to the public by the time of publication." This is a promise for future release, not concrete access at the time of publication.
Open Datasets Yes Specifically, we start by looking at the continuous control tasks of the Deep Mind Control Suite (Tassa et al. (2018), see Figure 1), and then consider the challenging parkour environments recently published in Heess et al. (2017). In addition, we present initial experiments for discrete control using ATARI environments using a categorical policy distribution (whose logits are again parameterized by a neural network) in the appendix.
Dataset Splits No No specific mention of training/validation/test splits with percentages or sample counts. The paper mentions tuning hyperparameters, which implies a validation process, but does not explicitly state how datasets were split for this purpose.
Hardware Specification No The paper mentions running experiments on a 'parallel variant' of the algorithm using 'distributed synchronous gradient descent' but does not specify any concrete hardware details like CPU/GPU models, memory, or specific cloud instances.
Software Dependencies No The paper mentions that the 'Control Suite' was built in python but does not provide specific version numbers for Python or any other software libraries, frameworks, or dependencies used for the experiments.
Experiment Setup Yes The hyperparameters for MPO were kept fixed for all experiments in the paper (see the appendix for hyperparameter settings). ... In this section we give the details on the hyper-parameters used for each experiment. All the continuous control experiments use a feed-forward network except for Parkour-2d were we used the same network architecture as in Heess et al. (2017). Other hyper parameters for MPO with non parametric variational distribution were set as follows, Table 2: Parameters for non-parametric variational distribution, Table 3: Parameters for parametric variational distribution