Balancing Individual Preferences and Shared Objectives in Multiagent Reinforcement Learning
Authors: Ishan Durugkar, Elad Liebman, Peter Stone
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we investigate empirically whether the above conclusion holds in more realistic, more complex settings.The experimental methodology is detailed in Algorithm 1. In Section 5.1 we detail two domains set in the multiagent MDP framework from section 2.1. In our first experiment (Section 5.2), we vary the different preference signals and the mixing schemes and compare the effect on learning the shared task. Somewhat surprisingly, we find that preferences accelerate improvement in the task performance in both these environments. Further, in Section 5.3 we demonstrate a method to find a mixing scheme that outperforms purely task-reward-based learning in both domains. |
| Researcher Affiliation | Collaboration | Ishan Durugkar1 , Elad Liebman2 and Peter Stone1,3 1University of Texas at Austin 2Spark Cognition Research 3Sony AI |
| Pseudocode | Yes | Algorithm 1 Experimental Methodology |
| Open Source Code | Yes | 1Appendix at https://tinyurl.com/yb8hzx73. |
| Open Datasets | Yes | We study the above framework on two multiagent cooperative domains: the well known predator prey domain [Barrett et al., 2013], and a new chord generation domain. |
| Dataset Splits | No | The paper uses reinforcement learning environments where agents interact directly, and does not specify traditional training/validation/test dataset splits with percentages or sample counts for a static dataset. Evaluation is performed by running episodes. |
| Hardware Specification | No | No specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running the experiments are mentioned in the paper. |
| Software Dependencies | No | The paper mentions using Proximal Policy Gradient (PPO) and GAIL, but does not provide specific software version numbers for programming languages, libraries, or other dependencies (e.g., Python version, PyTorch/TensorFlow version). |
| Experiment Setup | Yes | The discount factor γ [0, 1) specifies how much to discount future rewards. ... we use a discount factor γ = 0.99. ... Episodes are 100 steps long ... The task is trained in a continuing manner (with no termination), over 30000 steps, with γ = 0.99. ... pretrain(πi, Di) using Behavioral Cloning |