Policy Rehearsing: Training Generalizable Policies for Reinforcement Learning

Authors: Chengxing Jia, Chenxiao Gao, Hao Yin, Fuxiang Zhang, Xiong-Hui Chen, Tian Xu, Lei Yuan, Zongzhang Zhang, Zhi-Hua Zhou, Yang Yu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate that Re DM is capable of learning a valid policy solely through rehearsal, even with zero interaction data. We further extend Re DM to scenarios where limited or mismatched interaction data is available, and our experimental results reveal that Re DM produces high-performing policies compared to other offline RL baselines.
Researcher Affiliation Collaboration Chengxing Jia1, 2 , Chen-Xiao Gao1 , Hao Yin1, Fuxiang Zhang1, 2, Xiong-Hui Chen1, 2, Tian Xu1, 2, Lei Yuan1, 2, Zongzhang Zhang1, Zhi-Hua Zhou1, Yang Yu1, 2 1National Key Laboratory for Novel Software Technology, Nanjing University, China & School of Artificial Intelligence, Nanjing University, China 2Polixir Technologies
Pseudocode Yes Algorithm 1 Framework of Policy Rehearsing, Algorithm 2 Candidate Model Generation, Algorithm 3 Policy Optimization, Algorithm 4 Re DM, Algorithm 5 Candidate Model Generation with Offline Data.
Open Source Code No The paper mentions using implementations from CORL and a general Offline RL codebase for baselines (e.g., "For MOPO and MAPLE, we used the implementation from the Offline RL codebase 1."), but it does not explicitly state that the source code for Re DM or Re DM-o is made publicly available.
Open Datasets Yes We test Re DM with limited or misspecified data from D4RL, a widely used benchmark for offline RL... we conduct experiments on three representative Gym (Brockman et al., 2016) environments with continuous action spaces, namely Inverted Pendulum, Mountain Car (Continuous) and Acrobot.
Dataset Splits No The paper mentions sampling subsets from D4RL datasets (e.g., "sample a subset of data, which is only 200 or 5000 transitions, from random datasets in D4RL") and varying parameters for Gym environments, but it does not explicitly provide specific training/validation/test splits with percentages, counts, or references to predefined splits for their own experimental setup.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to conduct the experiments.
Software Dependencies No The paper mentions using software frameworks and algorithms like CORL, SAC, and PPO, but it does not specify any version numbers for these software components, which is necessary for a reproducible description of dependencies.
Experiment Setup Yes The hyperparameters we used are listed in Table 2 (for Re DM) and Table 3 (for Re DM-o). The hyper-parameters of concern include: λ, which balances between the diversity reward and eligibility reward in Eq. 3. N, which is the number of trajectories that the planner collects to compute the eligibility reward. Penalty, which is the coefficient of reward penalty... H, which is the policy rollout horizon. Emodel, Epolicy, Etotal, which are the training epochs for model optimization, policy optimization, and outer loop respectively.