Active Policy Improvement from Multiple Black-box Oracles
Authors: Xuefeng Liu, Takuma Yoneda, Chaoqi Wang, Matthew Walter, Yuxin Chen
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results show that MAPSSE significantly accelerates policy optimization via state-wise imitation learning from multiple oracles across a broad spectrum of control tasks in the Deep Mind Control Suite. Lastly, we conduct extensive experiments on the Deep Mind Control Suite benchmark that compare MAPS with MAMBA (Cheng et al., 2020), PPO (Schulman et al., 2017) with GAE (Schulman et al., 2015), and the best oracle from the oracle set. We empirically show that MAPS outperforms the current state-of-the-art (MAMBA). We present an analysis of these performance gains as well as the sample efficiency of our algorithm, which aligns with our theory. We then evaluate our proposed MAPS-SE algorithm and demonstrate that it provides performance gains over MAPS through various experiments. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Chicago, Chicago, IL, USA 2Toyota Technological Institute at Chicago, Chicago, IL, USA. |
| Pseudocode | Yes | Algorithm 1 Max-aggregation Active Policy Selection with Active State Exploration (MAPS-SE). Algorithm 2 Max-aggregation Active Policy Selection with Active State Exploration (MAPS-SE) |
| Open Source Code | No | The paper does not include any explicit statements about releasing source code or provide a link to a code repository. |
| Open Datasets | Yes | We evaluate our method on four continuous state and action environments: Cheetah-run, Cart Pole-swingup, Pendulum-swingup, and Walker-walk, which are part of the Deep Mind Control Suite (Tassa et al., 2018). |
| Dataset Splits | No | The paper mentions using specific environments for evaluation but does not provide details on how the dataset was split into training, validation, or test sets, nor does it refer to predefined splits from cited sources. |
| Hardware Specification | Yes | We performed our experiments on a cluster that includes CPU nodes (about 280 cores) and GPU nodes, about 110 Nvidia GPUs, ranging from Titan X to A6000, set up mostly in 4and 8-GPU nodes. |
| Software Dependencies | No | The paper mentions training oracle policies using proximal policy optimization (PPO) and soft actor-critic (SAC), but it does not specify version numbers for these software libraries or any other key dependencies like Python, PyTorch, or TensorFlow versions. |
| Experiment Setup | Yes | In line 7, we have a buffer with a fixed size ( Dn = 19, 200) for each oracle, and we discard the oldest data when it fills up. In line 9, we roll-out the learner policy until a buffer with a fixed size ( D n = 2,048) fills up, and empty it once we use them to update the learner policy. This stabilizes the training compared to storing a fixed number of trajectories to the buffer, as MAMBA does. In line 11, we adopted PPO style policy update. |