Policy Expansion for Bridging Offline-to-Online Reinforcement Learning

Authors: Haichao Zhang, Wei Xu, Haonan Yu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments are conducted on a number of tasks and the results demonstrate the effectiveness of the proposed approach. Code is available: https://github.com/Haichao-Zhang/PEX.
Researcher Affiliation Industry Haichao Zhang Wei Xu Haonan Yu Horizon Robotics, Cupertino CA 95014 {haichao.zhang, wei.xu, haonan.yu}@horizon.ai
Pseudocode Yes Algorithm 1 PEX: Policy Expansion for Offline-to-Online RL
Open Source Code Yes Code is available: https://github.com/Haichao-Zhang/PEX.
Open Datasets Yes We use the standard D4RL benchmark which has been widely used in offline RL community (Fu et al., 2020). For offline learning, we use the provided dataset for training. For online learning, we use the accompanied simulator for interaction and training.
Dataset Splits Yes We use the standard D4RL benchmark which has been widely used in offline RL community (Fu et al., 2020). D4RL is a well-established benchmark with predefined dataset splits.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. It only refers to the experimental setup and results.
Software Dependencies No The paper mentions using IQL (Kostrikov et al., 2022) and SAC (Haarnoja et al., 2018) as backbone algorithms, and refers to 'the code released by the authors of (Lee et al., 2021)' for one baseline, but it does not specify explicit version numbers for any software, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Hyper-parameters Values number of parallel env 1 discount 0.99 replay buffer size 1e6 batch size 256 MLP hidden layer size [256, 256] learning rate 3e-4 initial collection steps 5000 target update speed 5e-3 expectile value τ 0.9 (0.7) inverse temperature α 1 10 (3) number of offline iterations 1M number of online iterations 1M number of iteration per rollout step 1 target entropy (SAC) d