reproducibilityindex.ai

Discovering symbolic policies with deep reinforcement learning

Authors: Mikel Landajuela, Brenden K Petersen, Sookyung Kim, Claudio P Santiago, Ruben Glatt, Nathan Mundhenk, Jacob F Pettit, Daniel Faissol

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate DSP on eight benchmark control tasks: ﬁve single-action environments (Cart Pole, Mountain Car, Pendulum, Inverted Double Pendulum, and Inverted Pendulum Swingup) and three multi-action environments (Lunar Lander, Hopper, and Bipedal Walker). The best symbolic policies found for each environment are reported in Table 1. We include results both before and after constant optimization (labelled DSP and DSPo, respectively). In Table 2, we report the evaluation average episodic rewards for DSP, DSPo (i.e., DSP with optimization of constants), the Regression baseline, and the seven DRL baselines.
Researcher Affiliation	Academia	1Lawrence Livermore National Laboratory, Livermore, California, USA.
Pseudocode	Yes	Pseudocode for DSP with the anchoring algorithm is provided in the Appendix.
Open Source Code	Yes	Source code is made available at https://www.github.com/brendenpetersen/ deep-symbolic-optimization.
Open Datasets	Yes	Speciﬁcally, we use Cart Pole Continuous-v0 from https: //gist.github.com/iandanforth; Mountain Car Continuous-v0, Pendulum-v0, Lunar Lander Continuous-v2, and Bipedal Walker-v2 from Open AI Gym (Brockman et al., 2016); and Inverted Double Pendulum Bullet Env-v0, Inverted Pendulum Swingup Bullet Env-v0, and Hopper Bullet Env-v0 from Py Bullet (Coumans & Bai, 2016).
Dataset Splits	No	No specific details about training, validation, or test dataset splits (e.g., percentages or counts of data samples) were provided, as the training is done interactively in an RL environment, and evaluation uses varying environment seeds.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided.
Software Dependencies	No	The paper mentions software like Open AI Gym and Py Bullet but does not provide specific version numbers for these or any other ancillary software dependencies.
Experiment Setup	Yes	For each action in the control task, we perform 3 independent training runs of DSP with different random seeds, selecting the best symbolic policy at the end of training. All tasks use the library L = {+, , , , sin, cos, exp, log, 0.1, 1.0, 5.0, s1, . . . , sn}. We constrain the length of each expression to fall between 4 and 30 tokens, inclusive. The Policy Generator is an RNN comprising a single-layer LSTM with 32 hidden units.