Bootstrapped Transformer for Offline Reinforcement Learning
Authors: Kerong Wang, Hanye Zhao, Xufang Luo, Kan Ren, Weinan Zhang, Dongsheng Li
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on two offline RL benchmarks and demonstrate that our model can largely remedy the existing offline RL training limitations and beat other strong baseline methods. |
| Researcher Affiliation | Collaboration | Kerong Wang Shanghai Jiao Tong University Hanye Zhao Shanghai Jiao Tong University Xufang Luo Microsoft Research Asia Kan Ren Microsoft Research Asia Weinan Zhang Shanghai Jiao Tong University Dongsheng Li Microsoft Research Asia |
| Pseudocode | Yes | Algorithm 1 Training Procedure of Boo T |
| Open Source Code | Yes | The codes and supplementary materials are available at https://seqml.github.io/bootorl. |
| Open Datasets | Yes | We evaluate our Boo T algorithm on the dataset of continuous control tasks from the D4RL offline dataset [9]. |
| Dataset Splits | No | The paper mentions using the D4RL dataset and random seeds for training and evaluation, but does not explicitly specify the training/validation/test dataset splits (e.g., percentages or sample counts for each split). |
| Hardware Specification | Yes | All the experiments are run on a server with Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, and NVIDIA V100 GPU (32GB memory) with 8 cores. |
| Software Dependencies | No | The paper mentions using AdamW optimizer, but it does not specify software dependencies with version numbers (e.g., Python version, PyTorch version, CUDA version, or specific library versions). |
| Experiment Setup | Yes | We use AdamW optimizer with a learning rate of 6e-4 for TT training. The learning rate warms up for 10% of the total training steps, and linearly decays to 0. We train the model for 100 epochs with batch size 64 for Gym domain and 32 for Adroit domain. For discretization, we discretize states and actions into 1000 bins uniformly. |