Expert Proximity as Surrogate Rewards for Single Demonstration Imitation Learning

Authors: Chia-Cheng Chiang, Li-Cheng Lan, Wei-Fang Sun, Chien Feng, Cho-Jui Hsieh, Chun-Yi Lee

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate the efficacy of TDIL, we perform comprehensive experiments on five widely adopted Mu Jo Co benchmarks (Todorov et al., 2012), aligning with most prior IL research, as well as the Adroit Door environment (Rajeswaran et al., 2017) in the Gymnasium-Robotics collection (de Lazcano et al., 2023). The experimental evidence reveals that TDIL delivers exceptional performance, matches expert-level results on these benchmarks, and outperforms existing IL approaches.
Researcher Affiliation Collaboration 1ELSA Lab, Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan. 2Department of Computer Science, University of California, Los Angeles, CA, USA. 3NVIDIA AI Technology Center, NVIDIA Corporation, Santa Clara, CA, USA.
Pseudocode Yes Algorithm 1 presents a practical training methodology of the proposed method, refering to TDIL.
Open Source Code Yes The code implementation and expert data used in this work are available on this Git Hub repository.
Open Datasets Yes To validate the efficacy of TDIL, we perform comprehensive experiments on five widely adopted Mu Jo Co benchmarks (Todorov et al., 2012), aligning with most prior IL research, as well as the Adroit Door environment (Rajeswaran et al., 2017) in the Gymnasium-Robotics collection (de Lazcano et al., 2023).
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits with specific percentages or counts. It mentions training over 3M timesteps and using five random seeds, as well as discussing blind model selection.
Hardware Specification Yes Table A1. The hardware specification used to perform our experiments. CPU AMD Ryzen Threadripper 3990X 64-Core Processor GPU NVIDIA GeForce RTX 3090
Software Dependencies No The paper mentions using the Soft Actor-Critic (SAC) framework and MuJoCo, but does not provide specific version numbers for software dependencies like Python, PyTorch, or other libraries.
Experiment Setup Yes The actor and critic networks in SAC are implemented as neural networks with three hidden layers and rectified linear unit (ReLU) activation functions. Each of these hidden layers consists of 256 nodes. All algorithms are trained over 3M timesteps, each employing five random seeds. For the aggregate reward Ragg version, we select β = 0.9 based on our grid search, with the results presented in Table A7.