Blending Imitation and Reinforcement Learning for Robust Policy Improvement

Authors: Xuefeng Liu, Takuma Yoneda, Rick Stevens, Matthew Walter, Yuxin Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations and theoretical analysis validate that RPI excels in comparison to existing state-of-the-art methods, showing superior performance across various domains.
Researcher Affiliation Academia 1Department of Computer Science, University of Chicago 2Toyota Technological Institute at Chicago
Pseudocode Yes Algorithm 1 Robust Policy Improvement (RPI)
Open Source Code Yes Please checkout our website1. 1https://robust-policy-improvement.github.io/
Open Datasets Yes We evaluate our method on eight continuous state and action space domains: Cheetah-run, Cart Pole-swingup, Pendulum-swingup, and Walker-walk from the Deep Mind Control Suite (Tassa et al., 2018); and Window-close, Faucet-open, Drawer-close and Button-press from Meta-World (Yu et al., 2020).
Dataset Splits No No explicit training/test/validation dataset splits are provided, as the paper deals with reinforcement learning environments where evaluation is typically based on performance metrics over training steps rather than static dataset splits.
Hardware Specification Yes We conducted our experiments on a cluster that includes CPU nodes (approximately 280 cores) and GPU nodes (approximately 110 Nvidia GPUs, ranging from Titan X to A6000, set up mostly in 4- and 8-GPU configurations).
Software Dependencies No No specific software versions (e.g., Python, PyTorch, TensorFlow versions, library versions) are listed.
Experiment Setup Yes Table 3: RPI Hyperparameters. Learning rate 3 10^-4, Optimizer Adam, Nonlinearity Re Lu, # of functions in a value function ensemble 5, # of oracles in the oracle set (K) 3, The buffer size for oracle k |Dk| 19200, # of episodes to rollout the oracles for value function pretraining 8, Horizon of Meta World and DMControl (H) 300, 1000, Replay buffer size for the learner policy |D n| 2048, GAE gamma (γ) 0.995, GAE lambda (λ) for Aggre Va Te D and Max-Aggregation 0 for LOKi-variant 0 or 1 for RPI 0.9, # of training steps (rounds) (N) 100, # of episodes to perform RIRO (Alg 1, line 4) per training iteration 4, mini-batch size 128, # of epochs to perform gradient updates per training iteration 4.