SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation
Authors: Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, Le Song
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our algorithm compares favorably to state-of-the-art baselines in several benchmark control problems. and We tested SBEED across multiple continuous control tasks from the Open AI Gym benchmark (Brockman et al., 2016) using the Mu Jo Co simulator (Todorov et al., 2012), including Pendulum-v0, Inverted Double Pendulumv1, Half Cheetah-v1, Swimmer-v1, and Hopper-v1. |
| Researcher Affiliation | Collaboration | 1Georgia Institute of Technology 2Google Inc. 3Microsoft Research 4University of Illinois at Urbana Champaign 5Tencent AI Lab. |
| Pseudocode | Yes | Algorithm 1 Online SBEED learning with experience replay |
| Open Source Code | No | No explicit statement or link providing access to the open-source code for the described methodology. |
| Open Datasets | Yes | We tested SBEED across multiple continuous control tasks from the Open AI Gym benchmark (Brockman et al., 2016) using the Mu Jo Co simulator (Todorov et al., 2012), including Pendulum-v0, Inverted Double Pendulumv1, Half Cheetah-v1, Swimmer-v1, and Hopper-v1. |
| Dataset Splits | No | The paper uses continuous control tasks from Open AI Gym and Mu Jo Co simulator, which involve agent interaction with an environment to generate data for training. It does not describe explicit train/validation/test dataset splits in terms of percentages or sample counts for a pre-collected static dataset. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for running the experiments (e.g., specific GPU/CPU models or cloud instance types). |
| Software Dependencies | No | The paper mentions 'Open AI Gym' and 'Mu Jo Co simulator' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | The goal of our experimental evalution is two folds: (i) to better understand of the effect of each algorithmic component in the proposed algorithm; (ii) to demonstrate the stability and efficiency of SBEED in both off-policy and onpolicy settings. Therefore, we conducted an ablation study on SBEED, and a comprehensive comparison to state-ofthe-art reinforcement learning algorithms. While we derive and present SBEED for the single-step Bellman error case, it can be extended to multi-step cases as shown in the long version. In our experiment, we used this multi-step version. and We varied λ and evaluated the performance of SBEED. and The effect of such cancellation is controlled by η [0, 1], and we expected an intermediate value gives the best performance. This is verified by the experiment of varying η, as shown in Figure 1(b). and We tested the performance of the algorithm with different lookahead lengths (denoted by k). |