Koopman Q-learning: Offline Reinforcement Learning via Symmetries of Dynamics

Authors: Matthias Weissenbacher, Samarth Sinha, Animesh Garg, Kawahara Yoshinobu

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Moreover, we empirically evaluate our method on several benchmark offline reinforcement learning tasks and datasets including D4RL, Metaworld and Robosuite and find that by using our framework we consistently improve the state-of-the-art of model-free Q-learning methods.
Researcher Affiliation Academia 1RIKEN Center for Advanced Intelligence Project, Japan 2 Vector Institute, University of Toronto, Canada 3Institute of Mathematics for Industry, Kyushu University, Japan
Pseudocode No The paper describes the steps of the 'KFC-algorithm' in numbered points within a section, but these steps are presented in prose rather than a structured pseudocode block or algorithm box.
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We will first experiment with the popular D4RL benchmark commonly used for offline RL (Fu et al., 2021). The benchmark covers various different tasks such as locomotion tasks with Mujoco Gym (Brockman et al., 2016), tasks that require hierarchical planning such as antmaze, and other robotics tasks such as kitchen and adroit (Rajeswaran et al., 2017). Furthermore, similar to S4RL (Sinha et al., 2021), we perform experiments on 6 different challenging robotics tasks from Meta World (Yu et al., 2019) and Robo Suite (Zhu et al., 2020).
Dataset Splits Yes For training the Koopman forward model in Eq. (17) we split the dataset in a randomly selected training and validation set with ratios 70%/30%.
Hardware Specification Yes The hardware was as follows: NVIDIA DGX-2 with 16 V100 GPUs and 96 cores of Intel(R) Xeon(R) Platinum 8168 CPUs and NVIDIA DGX-1 with 8 A100 GPUs with 80 cores of Intel(R) Xeon(R) E5-2698 v4 CPUs. The models are trained on a single V100 or A100 GPU.
Software Dependencies Yes We performed the empirical experiments on a system with Py Torch 1.9.0a (Paszke et al., 2019)
Experiment Setup Yes We take over the hyperparameter settings from the CQL paper (Kumar et al., 2020) except for the fact that we do not use automatic entropy tuning of the policy optimisation step Eq. (3) but instead a fixed value of α = 0.2. The remaining hyper-parameters of the conservative Q-learning algorithm are as follows: γ = 0.99 and τ = 5 10 3 for the discount factor and target smoothing coefficient, respectively. Moreover, the policy learning rate is 1 10 4 and the value function learning rate is 3 10 4 for the ADAM optimizers. [...] The batch size is chosen to be 256. Moreover, the algorithm performs behavioral cloning for the first 40k training-steps, i.e. time-steps in an online RL notation.