reproducibilityindex.ai

Blending Imitation and Reinforcement Learning for Robust Policy Improvement

Authors: Xuefeng Liu, Takuma Yoneda, Rick Stevens, Matthew Walter, Yuxin Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations and theoretical analysis validate that RPI excels in comparison to existing state-of-the-art methods, showing superior performance across various domains.
Researcher Affiliation	Academia	1Department of Computer Science, University of Chicago 2Toyota Technological Institute at Chicago
Pseudocode	Yes	Algorithm 1 Robust Policy Improvement (RPI)
Open Source Code	Yes	Please checkout our website1. 1https://robust-policy-improvement.github.io/
Open Datasets	Yes	We evaluate our method on eight continuous state and action space domains: Cheetah-run, Cart Pole-swingup, Pendulum-swingup, and Walker-walk from the Deep Mind Control Suite (Tassa et al., 2018); and Window-close, Faucet-open, Drawer-close and Button-press from Meta-World (Yu et al., 2020).
Dataset Splits	No	No explicit training/test/validation dataset splits are provided, as the paper deals with reinforcement learning environments where evaluation is typically based on performance metrics over training steps rather than static dataset splits.
Hardware Specification	Yes	We conducted our experiments on a cluster that includes CPU nodes (approximately 280 cores) and GPU nodes (approximately 110 Nvidia GPUs, ranging from Titan X to A6000, set up mostly in 4- and 8-GPU configurations).
Software Dependencies	No	No specific software versions (e.g., Python, PyTorch, TensorFlow versions, library versions) are listed.
Experiment Setup	Yes	Table 3: RPI Hyperparameters. Learning rate 3 10^-4, Optimizer Adam, Nonlinearity Re Lu, # of functions in a value function ensemble 5, # of oracles in the oracle set (K) 3, The buffer size for oracle k \|Dk\| 19200, # of episodes to rollout the oracles for value function pretraining 8, Horizon of Meta World and DMControl (H) 300, 1000, Replay buffer size for the learner policy \|D n\| 2048, GAE gamma (γ) 0.995, GAE lambda (λ) for Aggre Va Te D and Max-Aggregation 0 for LOKi-variant 0 or 1 for RPI 0.9, # of training steps (rounds) (N) 100, # of episodes to perform RIRO (Alg 1, line 4) per training iteration 4, mini-batch size 128, # of epochs to perform gradient updates per training iteration 4.