Balancing Individual Preferences and Shared Objectives in Multiagent Reinforcement Learning

Authors: Ishan Durugkar, Elad Liebman, Peter Stone

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we investigate empirically whether the above conclusion holds in more realistic, more complex settings.The experimental methodology is detailed in Algorithm 1. In Section 5.1 we detail two domains set in the multiagent MDP framework from section 2.1. In our first experiment (Section 5.2), we vary the different preference signals and the mixing schemes and compare the effect on learning the shared task. Somewhat surprisingly, we find that preferences accelerate improvement in the task performance in both these environments. Further, in Section 5.3 we demonstrate a method to find a mixing scheme that outperforms purely task-reward-based learning in both domains.
Researcher Affiliation Collaboration Ishan Durugkar1 , Elad Liebman2 and Peter Stone1,3 1University of Texas at Austin 2Spark Cognition Research 3Sony AI
Pseudocode Yes Algorithm 1 Experimental Methodology
Open Source Code Yes 1Appendix at https://tinyurl.com/yb8hzx73.
Open Datasets Yes We study the above framework on two multiagent cooperative domains: the well known predator prey domain [Barrett et al., 2013], and a new chord generation domain.
Dataset Splits No The paper uses reinforcement learning environments where agents interact directly, and does not specify traditional training/validation/test dataset splits with percentages or sample counts for a static dataset. Evaluation is performed by running episodes.
Hardware Specification No No specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running the experiments are mentioned in the paper.
Software Dependencies No The paper mentions using Proximal Policy Gradient (PPO) and GAIL, but does not provide specific software version numbers for programming languages, libraries, or other dependencies (e.g., Python version, PyTorch/TensorFlow version).
Experiment Setup Yes The discount factor γ [0, 1) specifies how much to discount future rewards. ... we use a discount factor γ = 0.99. ... Episodes are 100 steps long ... The task is trained in a continuing manner (with no termination), over 30000 steps, with γ = 0.99. ... pretrain(πi, Di) using Behavioral Cloning