Efficient Symbolic Policy Learning with Differentiable Symbolic Expression

Authors: Jiaming Guo, Rui Zhang, Shaohui Peng, Qi Yi, Xing Hu, Ruizhi Chen, Zidong Du, xishan zhang, Ling Li, Qi Guo, Yunji Chen

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we show that our approach generates symbolic policies with higher performance and greatly improves data efficiency for single-task RL. In meta-RL, we demonstrate that compared with neural network policies the proposed symbolic policy achieves higher performance and efficiency and shows the potential to be interpretable.
Researcher Affiliation Collaboration 1 SKL of Processors, Institute of Computing Technology, CAS, Beijing, China 2 Intelligent Software Research Center, Institute of Software, CAS, Beijing, China 3 University of Science and Technology of China, USTC, Hefei, China 4 Cambricon Technologies 5 Shanghai Innovation Center for Processor Technologies, SHIC, Shanghai, China 6 University of Chinese Academy of Sciences, UCAS, Beijing, China
Pseudocode Yes Algorithm 1 The training process of ESPL. Input: The number of iterations for temperature and L0 norm schedule ts. The target temperature τt and target minimum L0 norm lt. The symbolic network SN and the parameters w, b. The probabilities p of the path selector. Algorithm 2 The training process of CSP.
Open Source Code No The paper mentions providing an example video and references a third-party codebase (rl-baselines-zoo) but does not explicitly state that its own methodology's source code is open-sourced or provide a direct link to it.
Open Datasets Yes For single-task RL, we evaluated our method on benchmark control tasks which are presented in DSP: (1) Cart Pole; (2) Mountain Car; (3) Pendulum; (4) Inverted Double Pendulum; (5) Inverted Pendulum Swingup; (6) Lunar Lander; (7) Hopper; (8) Bipedal Walker. For meta-RL, we evaluate the CSP on several continuous control environments which are modified from the environments of Open AI Gym [40]
Dataset Splits Yes For a training task, the target velocity is sampled uniformly from [0, 2.5]. For a test task, the target velocity is sampled uniformly from [2.5, 3.0]. The horizon length is set as 200. For the experiment, we sample 50 training tasks and 15 test tasks.
Hardware Specification Yes Besides, we train the proposed CSP and ESPL with Nvidia V100 GPU. When evaluating the inference time, we use Intel(R) Xeon(R) Gold 5218R @ 2.10GHz CPU.
Software Dependencies No The paper states: "The implementation of our ESPL and CSP is based on the pytorch[51]." It cites PyTorch but does not specify a version number.
Experiment Setup Yes In this section, we give the main hyperparameters of ESPL for single-task RL. We show the common hyperparameters of ESPL in Table 8. We also list the environment specific hyperparameters in Table 9.