Guarded Policy Optimization with Imperfect Online Demonstrations

Authors: Zhenghai Xue, Zhenghao Peng, Quanyi Li, Zhihan Liu, Bolei Zhou

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on various continuous control tasks show that our method can exploit teacher policies at different performance levels while maintaining a low training cost.
Researcher Affiliation Academia 1Nanyang Technological University, Singapore, 2 University of California, Los Angeles, 3The University of Edinburgh, 4Northwestern University
Pseudocode Yes Algorithm 1 The workflow of TS2C during training
Open Source Code Yes Code is available at https://metadriverse.github.io/TS2C.
Open Datasets Yes The majority of the experiments are conducted on the lightweight driving simulator Meta Drive (Li et al., 2022a). ... we also conduct experiments in several environments of the Mu Jo Co simulator (Todorov et al., 2012).
Dataset Splits No We choose 100 scenes for training and 50 held-out scenes for testing.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) were explicitly stated.
Experiment Setup Yes The hyper-parameters used in the experiments are shown in the following tables. In the TS2C algorithm, larger values of the intervention threshold ε1 and ε2 will lead to a more strict intervention criterion and the steps with teacher control will be fewer. ... Table 1: TS2C (Ours) Hyper-parameter Value Discount Factor γ 0.99 ... Learning Rate 0.0001 ...