Implementation Matters in Deep RL: A Case Study on PPO and TRPO

Authors: Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, Aleksander Madry

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 1 shows a histogram of the final rewards of agents trained with every possible configuration of the above optimizations for each configuration, a grid search for the optimal learning rate is performed, and we measure the reward of random agents trained using the identified learning rate. Our findings suggest that many code-level optimizations are necessary for PPO to attain its claimed performance. As our results show in Table 2, it turns out that code-level optimizations contribute to algorithms increased performance often significantly more than the choice of algorithm (i.e., using PPO vs. TRPO).
Researcher Affiliation Collaboration Logan Engstrom 1, Andrew Ilyas 1, Shibani Santurkar1, Dimitris Tsipras1, Firdaus Janoos2, Larry Rudolph1,2, and Aleksander M adry1 1MIT 2Two Sigma
Pseudocode Yes Algorithm 1 PPO scaling optimization.
Open Source Code Yes Code for all the results shown in this work is available at https://github.com/Madry Lab/ implementation-matters.
Open Datasets Yes Figure 1: An ablation study on the first four optimizations described in Section 3 (value clipping, reward scaling, network initialization, and learning rate annealing). For each of the 24 possible configurations of optimizations, we train a Humanoid-v2 (top) and Walker2d-v2 (bottom) agent using PPO with five random seeds and a grid of learning rates... MUJOCO TASK STEP WALKER2D-V2 HOPPER-V2 HUMANOID-V2
Dataset Splits No The paper describes hyperparameter grid searches and uses held-out data for evaluation, but does not explicitly provide details about a dedicated validation dataset split with specific percentages or counts.
Hardware Specification No The paper mentions 'restrictions on computational resources' but does not provide specific details such as exact GPU/CPU models, processor types, or memory.
Software Dependencies No The paper mentions software like Adam, PyTorch, TensorFlow, and OpenAI Baselines, but does not specify their version numbers or other key software dependencies with reproducible versioning.
Experiment Setup Yes All the hyperparameters used in this paper were obtained through grid searches. For PPO the exact code-level optimizations and their associated hyperparameters (e.g. coefficents for entropy regularization, reward clipping, etc.) were taken from the Open AI baselines repository 7, and gridding is performed over the value function learning rate, the clipping constant, and the learning rate schedule. ...The final parameters for each algorithm are given below: Table 4: Hyperparameters for all algorithms for Walker2d-v2. Table 5: Hyperparameters for all algorithms for Humanoid-v2. Table 6: Hyperparameters for all algorithms for Hopper-v2.