Discovering symbolic policies with deep reinforcement learning

Authors: Mikel Landajuela, Brenden K Petersen, Sookyung Kim, Claudio P Santiago, Ruben Glatt, Nathan Mundhenk, Jacob F Pettit, Daniel Faissol

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate DSP on eight benchmark control tasks: five single-action environments (Cart Pole, Mountain Car, Pendulum, Inverted Double Pendulum, and Inverted Pendulum Swingup) and three multi-action environments (Lunar Lander, Hopper, and Bipedal Walker). The best symbolic policies found for each environment are reported in Table 1. We include results both before and after constant optimization (labelled DSP and DSPo, respectively). In Table 2, we report the evaluation average episodic rewards for DSP, DSPo (i.e., DSP with optimization of constants), the Regression baseline, and the seven DRL baselines.
Researcher Affiliation Academia 1Lawrence Livermore National Laboratory, Livermore, California, USA.
Pseudocode Yes Pseudocode for DSP with the anchoring algorithm is provided in the Appendix.
Open Source Code Yes Source code is made available at https://www.github.com/brendenpetersen/ deep-symbolic-optimization.
Open Datasets Yes Specifically, we use Cart Pole Continuous-v0 from https: //gist.github.com/iandanforth; Mountain Car Continuous-v0, Pendulum-v0, Lunar Lander Continuous-v2, and Bipedal Walker-v2 from Open AI Gym (Brockman et al., 2016); and Inverted Double Pendulum Bullet Env-v0, Inverted Pendulum Swingup Bullet Env-v0, and Hopper Bullet Env-v0 from Py Bullet (Coumans & Bai, 2016).
Dataset Splits No No specific details about training, validation, or test dataset splits (e.g., percentages or counts of data samples) were provided, as the training is done interactively in an RL environment, and evaluation uses varying environment seeds.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided.
Software Dependencies No The paper mentions software like Open AI Gym and Py Bullet but does not provide specific version numbers for these or any other ancillary software dependencies.
Experiment Setup Yes For each action in the control task, we perform 3 independent training runs of DSP with different random seeds, selecting the best symbolic policy at the end of training. All tasks use the library L = {+, , , , sin, cos, exp, log, 0.1, 1.0, 5.0, s1, . . . , sn}. We constrain the length of each expression to fall between 4 and 30 tokens, inclusive. The Policy Generator is an RNN comprising a single-layer LSTM with 32 hidden units.