Projections for Approximate Policy Iteration Algorithms

Authors: Riad Akrour, Joni Pajarinen, Jan Peters, Gerhard Neumann

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our first set of experiments is on simple optimization problems to assess the validity of our proposed optimization scheme for constrained problems. Most of the introduced projections g are not on the constraint boundary, at the exception of the entropy constraint of a Gaussian distribution. Thus, it remains to be seen if optimizing L g by gradient ascent can match the quality of solutions obtained via the method of Lagrange multipliers on simple problems.
Researcher Affiliation Collaboration 1IAS, TU Darmstadt, Darmstadt, Germany 2Tampere University, Finland 3L-CAS, University of Lincoln, Lincoln, United Kingdom 4Bosch Center of Artificial Intelligence (BCAI), Germany 5Max Planck Institute for Intelligent Systems, T ubingen, Germany.
Pseudocode Yes Algorithm 1 DPS Gaussian policy projection; Algorithm 2 API linear-Gaussian policy projection
Open Source Code Yes Implementation of Alg. 2 is provided in https://github.com/akrouriad/papi.
Open Datasets Yes We run a first set of experiments on four benchmark tasks from Roboschool (Brockman et al., 2016).
Dataset Splits No The paper describes evaluation metrics and performance tracking during training (e.g., 'initial 100 iterations', 'best window of 500 trajectories'), but it does not specify explicit training/validation/test *dataset* splits, which are typically found with static datasets rather than dynamic RL environments.
Hardware Specification Yes Computations were conducted on the Lichtenberg high performance computer of TU Darmstadt and the NVIDIA DGX station.
Software Dependencies No The paper mentions implementing projections 'within Open AI s code base (Dhariwal et al., 2017)' but does not provide specific version numbers for any software libraries or dependencies.
Experiment Setup Yes All our experiments use a neural network policy with two hidden layers of 64 neurones. ... For all of the experiments including Fig. 4 PAPI-PPO refers to performing 20 epochs with mini-batches of size 64. For the entropy constraint, we adopt a two phase approach where we initially do not constrain the entropy until it reaches half of the initial entropy and then decrease β linearly by a fixed amount of ϵ. Using the same parameters for PAPI-TRPO would result in improvements over TRPO for some tasks but the entropy of the final policy was always relatively high. We obtained best performance for PAPI-TRPO by enforcing an entropy equality constraint using Prop. 1 and only optimizing A for 10 epochs with mini-batches of size 64.