Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demonstrations
Authors: Xingyu Wang, Diego Klabjan
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In numerical experiments, we demonstrate that our Nash Equilibrium and inverse reinforcement learning algorithms address games that are not amenable to existing benchmark algorithms. Moreover, our algorithm successfully recovers reward and policy functions regardless of the quality of the sub-optimal expert demonstration set. |
| Researcher Affiliation | Academia | 1Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL. Correspondence to: Diego Klabjan <d-klabjan@northwestern.edu>. |
| Pseudocode | Yes | Algorithm 1 Inverse Reinforcement Learning in Zero-Sum Discounted Stochastic Games; Algorithm 2 Adversarial Training Algorithm for Solving f (R) in Zero-Sum Games (Sketch) |
| Open Source Code | No | The paper does not provide any statement or link indicating that source code for the methodology is openly available. |
| Open Datasets | No | In order to test our IRL algorithm using the chasing game where the immediate reward is unknown, we generate the sub-optimal demonstration set D as follows. |
| Dataset Splits | No | The paper describes data generation for expert demonstrations and uses batches of samples during training and evaluation, but does not specify explicit training/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processors, or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions software like "deep neural networks", "Adam", and "Proximal Policy Optimization algorithm (PPO)", but does not provide specific version numbers for these software dependencies (e.g., PyTorch 1.9, TensorFlow 2.x). |
| Experiment Setup | Yes | For both policy models and state value function models required in actor-critic style PPO in Algorithm 2, we use deep neural nets with a 2-layer 256-neuron structure with rectified linear (Nair & Hinton, 2010) activation functions... For RθR(s) we use a 2-layer 256-neuron structure with rectified linear activation functions... we set KR = 1000, IR = 20, and τ = 3. The learning rate parameter for the reward function is 2.5 10 5, T is set as 50, and Adam (Kingma & Ba, 2014) is used as optimizer... For the PPO style training, we set horizon length T as 10, and refresh frequency Krefresh as 10. Parameter λ for eligibility traces is set as 0.9. Regarding the adversarial training, we set Kcycle as 100 and Kg as 90. The learning rate parameter for best response models is set as 3 10 4, while for the Nash Equilibrium policies fθf , gθg it is 10 4. Adam is used as optimizer. |