reproducibilityindex.ai

Policy Expansion for Bridging Offline-to-Online Reinforcement Learning

Authors: Haichao Zhang, Wei Xu, Haonan Yu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments are conducted on a number of tasks and the results demonstrate the effectiveness of the proposed approach. Code is available: https://github.com/Haichao-Zhang/PEX.
Researcher Affiliation	Industry	Haichao Zhang Wei Xu Haonan Yu Horizon Robotics, Cupertino CA 95014 {haichao.zhang, wei.xu, haonan.yu}@horizon.ai
Pseudocode	Yes	Algorithm 1 PEX: Policy Expansion for Ofﬂine-to-Online RL
Open Source Code	Yes	Code is available: https://github.com/Haichao-Zhang/PEX.
Open Datasets	Yes	We use the standard D4RL benchmark which has been widely used in ofﬂine RL community (Fu et al., 2020). For ofﬂine learning, we use the provided dataset for training. For online learning, we use the accompanied simulator for interaction and training.
Dataset Splits	Yes	We use the standard D4RL benchmark which has been widely used in ofﬂine RL community (Fu et al., 2020). D4RL is a well-established benchmark with predefined dataset splits.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. It only refers to the experimental setup and results.
Software Dependencies	No	The paper mentions using IQL (Kostrikov et al., 2022) and SAC (Haarnoja et al., 2018) as backbone algorithms, and refers to 'the code released by the authors of (Lee et al., 2021)' for one baseline, but it does not specify explicit version numbers for any software, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Hyper-parameters Values number of parallel env 1 discount 0.99 replay buffer size 1e6 batch size 256 MLP hidden layer size [256, 256] learning rate 3e-4 initial collection steps 5000 target update speed 5e-3 expectile value τ 0.9 (0.7) inverse temperature α 1 10 (3) number of ofﬂine iterations 1M number of online iterations 1M number of iteration per rollout step 1 target entropy (SAC) d