Discovering symbolic policies with deep reinforcement learning
Authors: Mikel Landajuela, Brenden K Petersen, Sookyung Kim, Claudio P Santiago, Ruben Glatt, Nathan Mundhenk, Jacob F Pettit, Daniel Faissol
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate DSP on eight benchmark control tasks: five single-action environments (Cart Pole, Mountain Car, Pendulum, Inverted Double Pendulum, and Inverted Pendulum Swingup) and three multi-action environments (Lunar Lander, Hopper, and Bipedal Walker). The best symbolic policies found for each environment are reported in Table 1. We include results both before and after constant optimization (labelled DSP and DSPo, respectively). In Table 2, we report the evaluation average episodic rewards for DSP, DSPo (i.e., DSP with optimization of constants), the Regression baseline, and the seven DRL baselines. |
| Researcher Affiliation | Academia | 1Lawrence Livermore National Laboratory, Livermore, California, USA. |
| Pseudocode | Yes | Pseudocode for DSP with the anchoring algorithm is provided in the Appendix. |
| Open Source Code | Yes | Source code is made available at https://www.github.com/brendenpetersen/ deep-symbolic-optimization. |
| Open Datasets | Yes | Specifically, we use Cart Pole Continuous-v0 from https: //gist.github.com/iandanforth; Mountain Car Continuous-v0, Pendulum-v0, Lunar Lander Continuous-v2, and Bipedal Walker-v2 from Open AI Gym (Brockman et al., 2016); and Inverted Double Pendulum Bullet Env-v0, Inverted Pendulum Swingup Bullet Env-v0, and Hopper Bullet Env-v0 from Py Bullet (Coumans & Bai, 2016). |
| Dataset Splits | No | No specific details about training, validation, or test dataset splits (e.g., percentages or counts of data samples) were provided, as the training is done interactively in an RL environment, and evaluation uses varying environment seeds. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided. |
| Software Dependencies | No | The paper mentions software like Open AI Gym and Py Bullet but does not provide specific version numbers for these or any other ancillary software dependencies. |
| Experiment Setup | Yes | For each action in the control task, we perform 3 independent training runs of DSP with different random seeds, selecting the best symbolic policy at the end of training. All tasks use the library L = {+, , , , sin, cos, exp, log, 0.1, 1.0, 5.0, s1, . . . , sn}. We constrain the length of each expression to fall between 4 and 30 tokens, inclusive. The Policy Generator is an RNN comprising a single-layer LSTM with 32 hidden units. |