Guarded Policy Optimization with Imperfect Online Demonstrations
Authors: Zhenghai Xue, Zhenghao Peng, Quanyi Li, Zhihan Liu, Bolei Zhou
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on various continuous control tasks show that our method can exploit teacher policies at different performance levels while maintaining a low training cost. |
| Researcher Affiliation | Academia | 1Nanyang Technological University, Singapore, 2 University of California, Los Angeles, 3The University of Edinburgh, 4Northwestern University |
| Pseudocode | Yes | Algorithm 1 The workflow of TS2C during training |
| Open Source Code | Yes | Code is available at https://metadriverse.github.io/TS2C. |
| Open Datasets | Yes | The majority of the experiments are conducted on the lightweight driving simulator Meta Drive (Li et al., 2022a). ... we also conduct experiments in several environments of the Mu Jo Co simulator (Todorov et al., 2012). |
| Dataset Splits | No | We choose 100 scenes for training and 50 held-out scenes for testing. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) were explicitly stated. |
| Experiment Setup | Yes | The hyper-parameters used in the experiments are shown in the following tables. In the TS2C algorithm, larger values of the intervention threshold ε1 and ε2 will lead to a more strict intervention criterion and the steps with teacher control will be fewer. ... Table 1: TS2C (Ours) Hyper-parameter Value Discount Factor γ 0.99 ... Learning Rate 0.0001 ... |