Policy Expansion for Bridging Offline-to-Online Reinforcement Learning
Authors: Haichao Zhang, Wei Xu, Haonan Yu
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments are conducted on a number of tasks and the results demonstrate the effectiveness of the proposed approach. Code is available: https://github.com/Haichao-Zhang/PEX. |
| Researcher Affiliation | Industry | Haichao Zhang Wei Xu Haonan Yu Horizon Robotics, Cupertino CA 95014 {haichao.zhang, wei.xu, haonan.yu}@horizon.ai |
| Pseudocode | Yes | Algorithm 1 PEX: Policy Expansion for Offline-to-Online RL |
| Open Source Code | Yes | Code is available: https://github.com/Haichao-Zhang/PEX. |
| Open Datasets | Yes | We use the standard D4RL benchmark which has been widely used in offline RL community (Fu et al., 2020). For offline learning, we use the provided dataset for training. For online learning, we use the accompanied simulator for interaction and training. |
| Dataset Splits | Yes | We use the standard D4RL benchmark which has been widely used in offline RL community (Fu et al., 2020). D4RL is a well-established benchmark with predefined dataset splits. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. It only refers to the experimental setup and results. |
| Software Dependencies | No | The paper mentions using IQL (Kostrikov et al., 2022) and SAC (Haarnoja et al., 2018) as backbone algorithms, and refers to 'the code released by the authors of (Lee et al., 2021)' for one baseline, but it does not specify explicit version numbers for any software, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Hyper-parameters Values number of parallel env 1 discount 0.99 replay buffer size 1e6 batch size 256 MLP hidden layer size [256, 256] learning rate 3e-4 initial collection steps 5000 target update speed 5e-3 expectile value τ 0.9 (0.7) inverse temperature α 1 10 (3) number of offline iterations 1M number of online iterations 1M number of iteration per rollout step 1 target entropy (SAC) d |