Projections for Approximate Policy Iteration Algorithms
Authors: Riad Akrour, Joni Pajarinen, Jan Peters, Gerhard Neumann
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our first set of experiments is on simple optimization problems to assess the validity of our proposed optimization scheme for constrained problems. Most of the introduced projections g are not on the constraint boundary, at the exception of the entropy constraint of a Gaussian distribution. Thus, it remains to be seen if optimizing L g by gradient ascent can match the quality of solutions obtained via the method of Lagrange multipliers on simple problems. |
| Researcher Affiliation | Collaboration | 1IAS, TU Darmstadt, Darmstadt, Germany 2Tampere University, Finland 3L-CAS, University of Lincoln, Lincoln, United Kingdom 4Bosch Center of Artificial Intelligence (BCAI), Germany 5Max Planck Institute for Intelligent Systems, T ubingen, Germany. |
| Pseudocode | Yes | Algorithm 1 DPS Gaussian policy projection; Algorithm 2 API linear-Gaussian policy projection |
| Open Source Code | Yes | Implementation of Alg. 2 is provided in https://github.com/akrouriad/papi. |
| Open Datasets | Yes | We run a first set of experiments on four benchmark tasks from Roboschool (Brockman et al., 2016). |
| Dataset Splits | No | The paper describes evaluation metrics and performance tracking during training (e.g., 'initial 100 iterations', 'best window of 500 trajectories'), but it does not specify explicit training/validation/test *dataset* splits, which are typically found with static datasets rather than dynamic RL environments. |
| Hardware Specification | Yes | Computations were conducted on the Lichtenberg high performance computer of TU Darmstadt and the NVIDIA DGX station. |
| Software Dependencies | No | The paper mentions implementing projections 'within Open AI s code base (Dhariwal et al., 2017)' but does not provide specific version numbers for any software libraries or dependencies. |
| Experiment Setup | Yes | All our experiments use a neural network policy with two hidden layers of 64 neurones. ... For all of the experiments including Fig. 4 PAPI-PPO refers to performing 20 epochs with mini-batches of size 64. For the entropy constraint, we adopt a two phase approach where we initially do not constrain the entropy until it reaches half of the initial entropy and then decrease β linearly by a fixed amount of ϵ. Using the same parameters for PAPI-TRPO would result in improvements over TRPO for some tasks but the entropy of the final policy was always relatively high. We obtained best performance for PAPI-TRPO by enforcing an entropy equality constraint using Prop. 1 and only optimizing A for 10 epochs with mini-batches of size 64. |