reproducibilityindex.ai

Balancing Individual Preferences and Shared Objectives in Multiagent Reinforcement Learning

Authors: Ishan Durugkar, Elad Liebman, Peter Stone

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we investigate empirically whether the above conclusion holds in more realistic, more complex settings.The experimental methodology is detailed in Algorithm 1. In Section 5.1 we detail two domains set in the multiagent MDP framework from section 2.1. In our ﬁrst experiment (Section 5.2), we vary the different preference signals and the mixing schemes and compare the effect on learning the shared task. Somewhat surprisingly, we ﬁnd that preferences accelerate improvement in the task performance in both these environments. Further, in Section 5.3 we demonstrate a method to ﬁnd a mixing scheme that outperforms purely task-reward-based learning in both domains.
Researcher Affiliation	Collaboration	Ishan Durugkar1 , Elad Liebman2 and Peter Stone1,3 1University of Texas at Austin 2Spark Cognition Research 3Sony AI
Pseudocode	Yes	Algorithm 1 Experimental Methodology
Open Source Code	Yes	1Appendix at https://tinyurl.com/yb8hzx73.
Open Datasets	Yes	We study the above framework on two multiagent cooperative domains: the well known predator prey domain [Barrett et al., 2013], and a new chord generation domain.
Dataset Splits	No	The paper uses reinforcement learning environments where agents interact directly, and does not specify traditional training/validation/test dataset splits with percentages or sample counts for a static dataset. Evaluation is performed by running episodes.
Hardware Specification	No	No specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running the experiments are mentioned in the paper.
Software Dependencies	No	The paper mentions using Proximal Policy Gradient (PPO) and GAIL, but does not provide specific software version numbers for programming languages, libraries, or other dependencies (e.g., Python version, PyTorch/TensorFlow version).
Experiment Setup	Yes	The discount factor γ [0, 1) speciﬁes how much to discount future rewards. ... we use a discount factor γ = 0.99. ... Episodes are 100 steps long ... The task is trained in a continuing manner (with no termination), over 30000 steps, with γ = 0.99. ... pretrain(πi, Di) using Behavioral Cloning