Koopman Q-learning: Offline Reinforcement Learning via Symmetries of Dynamics
Authors: Matthias Weissenbacher, Samarth Sinha, Animesh Garg, Kawahara Yoshinobu
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Moreover, we empirically evaluate our method on several benchmark offline reinforcement learning tasks and datasets including D4RL, Metaworld and Robosuite and find that by using our framework we consistently improve the state-of-the-art of model-free Q-learning methods. |
| Researcher Affiliation | Academia | 1RIKEN Center for Advanced Intelligence Project, Japan 2 Vector Institute, University of Toronto, Canada 3Institute of Mathematics for Industry, Kyushu University, Japan |
| Pseudocode | No | The paper describes the steps of the 'KFC-algorithm' in numbered points within a section, but these steps are presented in prose rather than a structured pseudocode block or algorithm box. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We will first experiment with the popular D4RL benchmark commonly used for offline RL (Fu et al., 2021). The benchmark covers various different tasks such as locomotion tasks with Mujoco Gym (Brockman et al., 2016), tasks that require hierarchical planning such as antmaze, and other robotics tasks such as kitchen and adroit (Rajeswaran et al., 2017). Furthermore, similar to S4RL (Sinha et al., 2021), we perform experiments on 6 different challenging robotics tasks from Meta World (Yu et al., 2019) and Robo Suite (Zhu et al., 2020). |
| Dataset Splits | Yes | For training the Koopman forward model in Eq. (17) we split the dataset in a randomly selected training and validation set with ratios 70%/30%. |
| Hardware Specification | Yes | The hardware was as follows: NVIDIA DGX-2 with 16 V100 GPUs and 96 cores of Intel(R) Xeon(R) Platinum 8168 CPUs and NVIDIA DGX-1 with 8 A100 GPUs with 80 cores of Intel(R) Xeon(R) E5-2698 v4 CPUs. The models are trained on a single V100 or A100 GPU. |
| Software Dependencies | Yes | We performed the empirical experiments on a system with Py Torch 1.9.0a (Paszke et al., 2019) |
| Experiment Setup | Yes | We take over the hyperparameter settings from the CQL paper (Kumar et al., 2020) except for the fact that we do not use automatic entropy tuning of the policy optimisation step Eq. (3) but instead a fixed value of α = 0.2. The remaining hyper-parameters of the conservative Q-learning algorithm are as follows: γ = 0.99 and τ = 5 10 3 for the discount factor and target smoothing coefficient, respectively. Moreover, the policy learning rate is 1 10 4 and the value function learning rate is 3 10 4 for the ADAM optimizers. [...] The batch size is chosen to be 256. Moreover, the algorithm performs behavioral cloning for the first 40k training-steps, i.e. time-steps in an online RL notation. |