Representation Balancing Offline Model-based Reinforcement Learning

Authors: Byung-Jun Lee, Jongmin Lee, Kee-Eung Kim

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show that the model trained by the Rep B-SDE objective is robust to the distribution shift for the OPE task, particularly when the difference between the target and the behavior is large. We also introduce a model-based offline RL algorithm based on the Rep B-SDE framework and report its performance on the D4RL benchmark (Fu et al., 2020), showing the state-of-the-art performance in a representative set of tasks.
Researcher Affiliation Collaboration Byung-Jun Lee1,3, Jongmin Lee1 & Kee-Eung Kim1,2 1School of Computing, KAIST, Daejeon, Republic of Korea 2Graduate School of AI, KAIST, Daejeon, Republic of Korea 3Gauss Labs Inc., Seoul, Republic of Korea
Pseudocode Yes We present the pseudo-code of the presented Representation Balancing Offline Model-based RL algorithm below. Algorithm 1 Representation Balancing Offline Model-based RL
Open Source Code Yes The code used to produce the results is available online.2 Footnote 2: https://github.com/dlqudwns/repb-sde
Open Datasets Yes report its performance on the D4RL benchmark (Fu et al., 2020) and We evaluate the offline model-based RL algorithm presented in Section 4.2 on a subset of datasets in the D4RL benchmark (Fu et al., 2020): using four types of datasets (Random, Medium, Medium Replay, and Medium-Expert) from three different Mu Jo Co environments (Half Cheetah-v2, Hopperv2, and Walker2d-v2) (Todorov et al., 2012).
Dataset Splits Yes Across all domains, we train an ensemble of 7 models and pick the best 5 models on their validation error on hold-out set of 1000 transitions in the dataset.
Hardware Specification Yes All experiments were conducted on the Google Cloud Platform. Specifically, we used computeoptimized machines (c2-standard-4) that provide 4 v CPUs and 16 GB memory for the evaluation experiment of Section 5.1, and we used high-memory machines (n1-highmem-4), which provide 4 v CPUs and 26GB memory, equipped with an Nvidia Tesla K80 GPU for the RL experiment of Section 5.2.
Software Dependencies No The paper mentions using Adam for optimization and SAC for policy optimization, but does not provide specific version numbers for these software components or any other libraries like Python, PyTorch, or TensorFlow.
Experiment Setup Yes We standardized the inputs and outputs of the neural network and used Adam (Kingma & Ba, 2014) with a learning rate of 3 10 4 for the optimization. and Common hyperparameters shared among algorithms are shown in Table 3: learning rate 3 10 4, discount factor γ 0.99, number of samples per minibatch 256, target smoothing coefficient τ 5 10 3, [actor/critic] number of hidden layers 2, [actor/critic] number of hidden units per layer 256, [actor/critic] non-linearity Re LU, # of rollouts 104, max length of rollouts 103, rollout buffer size 5 107.