Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demonstrations

Authors: Xingyu Wang, Diego Klabjan

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In numerical experiments, we demonstrate that our Nash Equilibrium and inverse reinforcement learning algorithms address games that are not amenable to existing benchmark algorithms. Moreover, our algorithm successfully recovers reward and policy functions regardless of the quality of the sub-optimal expert demonstration set.
Researcher Affiliation Academia 1Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL. Correspondence to: Diego Klabjan <d-klabjan@northwestern.edu>.
Pseudocode Yes Algorithm 1 Inverse Reinforcement Learning in Zero-Sum Discounted Stochastic Games; Algorithm 2 Adversarial Training Algorithm for Solving f (R) in Zero-Sum Games (Sketch)
Open Source Code No The paper does not provide any statement or link indicating that source code for the methodology is openly available.
Open Datasets No In order to test our IRL algorithm using the chasing game where the immediate reward is unknown, we generate the sub-optimal demonstration set D as follows.
Dataset Splits No The paper describes data generation for expert demonstrations and uses batches of samples during training and evaluation, but does not specify explicit training/validation/test dataset splits with percentages or sample counts.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processors, or memory used for running the experiments.
Software Dependencies No The paper mentions software like "deep neural networks", "Adam", and "Proximal Policy Optimization algorithm (PPO)", but does not provide specific version numbers for these software dependencies (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup Yes For both policy models and state value function models required in actor-critic style PPO in Algorithm 2, we use deep neural nets with a 2-layer 256-neuron structure with rectified linear (Nair & Hinton, 2010) activation functions... For RθR(s) we use a 2-layer 256-neuron structure with rectified linear activation functions... we set KR = 1000, IR = 20, and τ = 3. The learning rate parameter for the reward function is 2.5 10 5, T is set as 50, and Adam (Kingma & Ba, 2014) is used as optimizer... For the PPO style training, we set horizon length T as 10, and refresh frequency Krefresh as 10. Parameter λ for eligibility traces is set as 0.9. Regarding the adversarial training, we set Kcycle as 100 and Kg as 90. The learning rate parameter for best response models is set as 3 10 4, while for the Nash Equilibrium policies fθf , gθg it is 10 4. Adam is used as optimizer.