Efficient Exploration in Continuous-time Model-based Reinforcement Learning
Authors: Lenart Treven, Jonas Hübotter, Bhavya Sukhija, Florian Dorfler, Andreas Krause
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We showcase the benefits of continuous-time modeling over its discrete-time counterpart, as well as our proposed adaptive MSS over standard baselines, on several applications. 5 Experiments We now empirically evaluate the performance of OCORL on several environments. We test OCORL on Cancer Treatment and Glucose in blood systems from Howe et al. (2022), Pendulum, Mountain Car and Cart Pole from Brockman et al. (2016), Bicycle from Polack et al. (2017), Furuta Pendulum from Lutter et al. (2021) and Quadrotor in 2D and 3D from Nonami et al. (2010). The details of the systems dynamics and tasks are provided in Appendix C. Comparison methods To make the comparison fair, we adjust methods so that they all collect the same number of measurements per episode. For the equidistant setting, we collect M points per episode (we provide values of M for different systems in Appendix C). For the adaptive MSS, we assume n T, and instead of one measurement per episode we collect a batch of M measurements such that they (as a batch) maximize the variance on the hallucinated trajectory. To this end, we consider the Greedy Max Determinant and Greedy Max Kernel Distance strategies of Holzmüller et al. (2022). We provide details of the adaptive strategies in Appendix C. We compare OCORL with the described MSSs to the optimal discrete-time zero hold control, where we assume the access to the true discretized dynamics f d (x, u) = x + R T/(M 1) 0 f (x(t), u) dt. We further also compare with the best continuous-time control policy, i.e., the solution of Equation (1). Does the continuous-time control policy perform better than the discrete-time control policy? In the first experiment, we test whether learning a continuous-time model from the finite data coupled with a continuous-time control policy on the learned model can outperform the discrete-time zero-order hold control on the true system. We conduct the experiment on all environments and report the cost after running OCORL for a few tens of episodes (the exact experimental details are provided in Appendix C). From Table 1, we conclude that the OCORL outperforms the discrete-time zero-order hold control on the true model on every system if we use the adaptive MSS, while achieving lower cost on 7 out of 9 systems if we measure the system equidistantly. Table 1: OCORL with adaptive MSSs achieves lower final cost C(πN, f ) compared to the discrete-time control on the true system on all tested environments while converging towards the best continuous-time control policy. While equidistant MSS achieves higher cost compared to the adaptive MSS, it still outperforms the discrete-time zero-order hold control on the true model for most systems. |
| Researcher Affiliation | Academia | Lenart Treven ETH Zürich trevenl@ethz.ch Jonas Hübotter ETH Zürich jhuebotter@student.ethz.ch Bhavya Sukhija ETH Zürich sukhijab@ethz.ch Florian Dörfler ETH Zürich dorfler@ethz.ch Andreas Krause ETH Zürich krausea@ethz.ch |
| Pseudocode | Yes | Algorithm 1 GREEDY MAX DETERMINANT ... Algorithm 2 GREEDY MAX KERNEL DISTANCE |
| Open Source Code | Yes | Finally, we provide an efficient implementation1 of OCORL in JAX (Bradbury et al., 2018). 1https://github.com/lenarttreven/ocorl |
| Open Datasets | Yes | We test OCORL on Cancer Treatment and Glucose in blood systems from Howe et al. (2022), Pendulum, Mountain Car and Cart Pole from Brockman et al. (2016), Bicycle from Polack et al. (2017), Furuta Pendulum from Lutter et al. (2021) and Quadrotor in 2D and 3D from Nonami et al. (2010). |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits or refer to a validation set. It describes evaluation on continuous environments/systems. |
| Hardware Specification | No | The paper does not specify any hardware used for running its experiments, such as CPU or GPU models. |
| Software Dependencies | No | The paper mentions 'JAX (Bradbury et al., 2018)' as the framework for implementation but does not provide a specific version number for JAX or any other software dependencies. |
| Experiment Setup | Yes | Table 4: We take a few tens of measurements per episode in each environment. We run the experiments for at most 40 episodes in each environment. ... Table 4: Episodes Hyperparameters ... Table 5: The MPC horizon for considered systems. |