Blending Imitation and Reinforcement Learning for Robust Policy Improvement
Authors: Xuefeng Liu, Takuma Yoneda, Rick Stevens, Matthew Walter, Yuxin Chen
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations and theoretical analysis validate that RPI excels in comparison to existing state-of-the-art methods, showing superior performance across various domains. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Chicago 2Toyota Technological Institute at Chicago |
| Pseudocode | Yes | Algorithm 1 Robust Policy Improvement (RPI) |
| Open Source Code | Yes | Please checkout our website1. 1https://robust-policy-improvement.github.io/ |
| Open Datasets | Yes | We evaluate our method on eight continuous state and action space domains: Cheetah-run, Cart Pole-swingup, Pendulum-swingup, and Walker-walk from the Deep Mind Control Suite (Tassa et al., 2018); and Window-close, Faucet-open, Drawer-close and Button-press from Meta-World (Yu et al., 2020). |
| Dataset Splits | No | No explicit training/test/validation dataset splits are provided, as the paper deals with reinforcement learning environments where evaluation is typically based on performance metrics over training steps rather than static dataset splits. |
| Hardware Specification | Yes | We conducted our experiments on a cluster that includes CPU nodes (approximately 280 cores) and GPU nodes (approximately 110 Nvidia GPUs, ranging from Titan X to A6000, set up mostly in 4- and 8-GPU configurations). |
| Software Dependencies | No | No specific software versions (e.g., Python, PyTorch, TensorFlow versions, library versions) are listed. |
| Experiment Setup | Yes | Table 3: RPI Hyperparameters. Learning rate 3 10^-4, Optimizer Adam, Nonlinearity Re Lu, # of functions in a value function ensemble 5, # of oracles in the oracle set (K) 3, The buffer size for oracle k |Dk| 19200, # of episodes to rollout the oracles for value function pretraining 8, Horizon of Meta World and DMControl (H) 300, 1000, Replay buffer size for the learner policy |D n| 2048, GAE gamma (γ) 0.995, GAE lambda (λ) for Aggre Va Te D and Max-Aggregation 0 for LOKi-variant 0 or 1 for RPI 0.9, # of training steps (rounds) (N) 100, # of episodes to perform RIRO (Alg 1, line 4) per training iteration 4, mini-batch size 128, # of epochs to perform gradient updates per training iteration 4. |