reproducibilityindex.ai

Koopman Q-learning: Offline Reinforcement Learning via Symmetries of Dynamics

Authors: Matthias Weissenbacher, Samarth Sinha, Animesh Garg, Kawahara Yoshinobu

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Moreover, we empirically evaluate our method on several benchmark offline reinforcement learning tasks and datasets including D4RL, Metaworld and Robosuite and find that by using our framework we consistently improve the state-of-the-art of model-free Q-learning methods.
Researcher Affiliation	Academia	1RIKEN Center for Advanced Intelligence Project, Japan 2 Vector Institute, University of Toronto, Canada 3Institute of Mathematics for Industry, Kyushu University, Japan
Pseudocode	No	The paper describes the steps of the 'KFC-algorithm' in numbered points within a section, but these steps are presented in prose rather than a structured pseudocode block or algorithm box.
Open Source Code	No	The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	We will first experiment with the popular D4RL benchmark commonly used for offline RL (Fu et al., 2021). The benchmark covers various different tasks such as locomotion tasks with Mujoco Gym (Brockman et al., 2016), tasks that require hierarchical planning such as antmaze, and other robotics tasks such as kitchen and adroit (Rajeswaran et al., 2017). Furthermore, similar to S4RL (Sinha et al., 2021), we perform experiments on 6 different challenging robotics tasks from Meta World (Yu et al., 2019) and Robo Suite (Zhu et al., 2020).
Dataset Splits	Yes	For training the Koopman forward model in Eq. (17) we split the dataset in a randomly selected training and validation set with ratios 70%/30%.
Hardware Specification	Yes	The hardware was as follows: NVIDIA DGX-2 with 16 V100 GPUs and 96 cores of Intel(R) Xeon(R) Platinum 8168 CPUs and NVIDIA DGX-1 with 8 A100 GPUs with 80 cores of Intel(R) Xeon(R) E5-2698 v4 CPUs. The models are trained on a single V100 or A100 GPU.
Software Dependencies	Yes	We performed the empirical experiments on a system with Py Torch 1.9.0a (Paszke et al., 2019)
Experiment Setup	Yes	We take over the hyperparameter settings from the CQL paper (Kumar et al., 2020) except for the fact that we do not use automatic entropy tuning of the policy optimisation step Eq. (3) but instead a fixed value of α = 0.2. The remaining hyper-parameters of the conservative Q-learning algorithm are as follows: γ = 0.99 and τ = 5 10 3 for the discount factor and target smoothing coefficient, respectively. Moreover, the policy learning rate is 1 10 4 and the value function learning rate is 3 10 4 for the ADAM optimizers. [...] The batch size is chosen to be 256. Moreover, the algorithm performs behavioral cloning for the first 40k training-steps, i.e. time-steps in an online RL notation.