reproducibilityindex.ai

Representation Balancing Offline Model-based Reinforcement Learning

Authors: Byung-Jun Lee, Jongmin Lee, Kee-Eung Kim

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically show that the model trained by the Rep B-SDE objective is robust to the distribution shift for the OPE task, particularly when the difference between the target and the behavior is large. We also introduce a model-based ofﬂine RL algorithm based on the Rep B-SDE framework and report its performance on the D4RL benchmark (Fu et al., 2020), showing the state-of-the-art performance in a representative set of tasks.
Researcher Affiliation	Collaboration	Byung-Jun Lee1,3, Jongmin Lee1 & Kee-Eung Kim1,2 1School of Computing, KAIST, Daejeon, Republic of Korea 2Graduate School of AI, KAIST, Daejeon, Republic of Korea 3Gauss Labs Inc., Seoul, Republic of Korea
Pseudocode	Yes	We present the pseudo-code of the presented Representation Balancing Ofﬂine Model-based RL algorithm below. Algorithm 1 Representation Balancing Ofﬂine Model-based RL
Open Source Code	Yes	The code used to produce the results is available online.2 Footnote 2: https://github.com/dlqudwns/repb-sde
Open Datasets	Yes	report its performance on the D4RL benchmark (Fu et al., 2020) and We evaluate the ofﬂine model-based RL algorithm presented in Section 4.2 on a subset of datasets in the D4RL benchmark (Fu et al., 2020): using four types of datasets (Random, Medium, Medium Replay, and Medium-Expert) from three different Mu Jo Co environments (Half Cheetah-v2, Hopperv2, and Walker2d-v2) (Todorov et al., 2012).
Dataset Splits	Yes	Across all domains, we train an ensemble of 7 models and pick the best 5 models on their validation error on hold-out set of 1000 transitions in the dataset.
Hardware Specification	Yes	All experiments were conducted on the Google Cloud Platform. Speciﬁcally, we used computeoptimized machines (c2-standard-4) that provide 4 v CPUs and 16 GB memory for the evaluation experiment of Section 5.1, and we used high-memory machines (n1-highmem-4), which provide 4 v CPUs and 26GB memory, equipped with an Nvidia Tesla K80 GPU for the RL experiment of Section 5.2.
Software Dependencies	No	The paper mentions using Adam for optimization and SAC for policy optimization, but does not provide specific version numbers for these software components or any other libraries like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	We standardized the inputs and outputs of the neural network and used Adam (Kingma & Ba, 2014) with a learning rate of 3 10 4 for the optimization. and Common hyperparameters shared among algorithms are shown in Table 3: learning rate 3 10 4, discount factor γ 0.99, number of samples per minibatch 256, target smoothing coefﬁcient τ 5 10 3, [actor/critic] number of hidden layers 2, [actor/critic] number of hidden units per layer 256, [actor/critic] non-linearity Re LU, # of rollouts 104, max length of rollouts 103, rollout buffer size 5 107.