Improving Policy Optimization with Generalist-Specialist Learning
Authors: Zhiwei Jia, Xuanlin Li, Zhan Ling, Shuang Liu, Yiran Wu, Hao Su
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically observe that an agent trained on many variations (a generalist) tends to learn faster at the beginning, yet its performance plateaus at a less optimal level for a long time. In contrast, an agent trained only on a few variations (a specialist) can often achieve high returns under a limited computational budget. To have the best of both worlds, we propose a novel generalist-specialist training framework. Specifically, we first train a generalist on all environment variations; when it fails to improve, we launch a large population of specialists with weights cloned from the generalist, each trained to master a selected small subset of variations. We finally resume the training of the generalist with auxiliary rewards induced by demonstrations of all specialists. In particular, we investigate the timing to start specialist training and compare strategies to learn generalists with assistance from specialists. We show that this framework pushes the envelope of policy learning on several challenging and popular benchmarks including Procgen, Meta-World and Mani Skill. |
| Researcher Affiliation | Academia | University of California, San Diego. Correspondence to: Zhiwei Jia <zjia@eng.ucsd.edu>, Hao Su <haosu@eng.ucsd.edu>. |
| Pseudocode | Yes | Algorithm 1 GSL: Generalist-Specialist Learning Require: (1) Environment E with context space C (2) Number of specialists Ns (3) Number of env. variations for specialist Nlenv (4) Number of demonstrations N g D from generalist and N s D from specialists (5) Performance plateau criteria H 1: Initialize generalist policy πg 2: Train πg on E until H = 1 e.g., PPO, SAC 3: if πg optimal enough then 4: Exit done with GSL 5: end if 6: Find the Nlenv lowest-performing environment variations from E, collectively denoted as Elow. 7: Split Elow into Ns disjoint environment variations {Ei} by splitting the context space C. 8: Obtain πg low by fine-tuning πg on Elow optional 9: for each i = 1 Ns do in parallel 10: Initialize specialist πs i = πg or πg low 11: Train πs i on Ei 12: Generate Ns D Ns demos Ti with πs i on Ei 13: end for each 14: Generate N g D demos Tg with πg on E\Elow 15: Continue training πg on E with auxiliary rewards induced from {T D i } Tg (via DAPG, GAIL, etc.) |
| Open Source Code | No | 1As a meta-algorithm, the (pseudo)code is available here. The paper states the pseudocode is 'available here' in a footnote, but does not provide a direct URL or specific repository link for the source code. |
| Open Datasets | Yes | We evaluate our Generalist-Specialist Learning (GSL) framework on three challenging benchmarks: Procgen (Cobbe et al., 2020b), Meta-World (Yu et al., 2020b) and SAPIEN Manipulation Skill Benchmark (Mani Skill Benchmark (Mu et al., 2021)). For all environments, we use seeds (levels) from 1000 to 2023 for training and from 100000 to 100999 for testing. |
| Dataset Splits | No | The paper explicitly mentions 'training levels' (e.g., '1024 levels for training') and 'test levels' (e.g., '1000 hold-out test levels'), but does not provide details for a separate validation split. |
| Hardware Specification | No | The paper mentions 'Number of threads for collecting samples' (e.g., 64 for Procgen, 10/50 for Meta-World, 4 for Mani Skill), which implies parallel computation, but it does not specify any particular GPU models, CPU models, memory, or cloud computing instances with detailed specifications. |
| Software Dependencies | No | The paper mentions various algorithms like PPO, SAC, DAPG, GAIL, and PPG, and specific models like IMPALA CNN and Point Net + Transformer. However, it does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used. |
| Experiment Setup | Yes | Appendix A provides detailed hyperparameters for the illustrative example, Procgen, Meta-World, and Mani Skill experiments across multiple tables (Tables 4, 5, 6, 7, 8, 9, 10, 11, 12). These include optimizer, learning rate, discount factor, PPO clip range, entropy loss coefficients, number of threads, samples per epoch/minibatch, total simulation steps, and various GSL-specific parameters. |