Scalable Multi-agent Covering Option Discovery based on Kronecker Graphs
Authors: Jiayu Chen, Jingdi Chen, Tian Lan, Vaneet Aggarwal
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The evaluation on multi-agent tasks built with simulators like Mujoco, shows that the proposed algorithm can successfully identify multi-agent options, and significantly outperforms the state-of-the-art. |
| Researcher Affiliation | Academia | Jiayu Chen Purdue University West Lafayette, IN 47907 chen3686@purdue.edu Jingdi Chen The George Washington University Washington, DC 20052 jingdic@gwu.edu Tian Lan The George Washington University Washington, DC 20052 tlan@gwu.edu Vaneet Aggarwal Purdue University West Lafayette, IN 47907 vaneet@purdue.edu |
| Pseudocode | Yes | of which the detailed pseudo codes are in Appendix D. |
| Open Source Code | Yes | Codes are available at: https://github.itap.purdue.edu/Clan-labs/Scalable_MAOD_via_KP. |
| Open Datasets | No | The paper mentions using tasks built with simulators (Grid Room, Grid Maze, Mujoco), but does not provide concrete access information (link, DOI, formal citation) for specific, publicly available datasets used for training. |
| Dataset Splits | No | The paper describes collecting transitions from environment interactions but does not specify explicit train/validation/test dataset splits with percentages, sample counts, or references to predefined splits for model training or evaluation in the traditional sense. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running its experiments. It only mentions using simulators. |
| Software Dependencies | No | The paper mentions various algorithms and simulators like Mujoco, Distributed Q-Learning, Value Iteration, MAPPO, and Soft Actor-Critic, but it does not specify version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We collect 1 × 104 transitions (i.e., {(s, a, s )}) for the discrete tasks and 5 × 104 transitions for the continuous control tasks in order to build the state transition graphs. ... As for the number, we learn 16 multi-agent options in the Grid tasks and 8 multi-agent options in the Mujoco tasks. ... The maximum of steps that an agent can take is 200 for the discrete task and 500 for the continuous task. We run each experiment for five times with different random seeds and plot the mean and standard deviation during the training process. ... For discrete tasks (e.g., Figure 4(a) and 4(b)), we adopt Distributed Q-Learning [30] (decentralized manner: each agent decides on its own option based on the joint state) or Centralized Q-Learning + Force (centralized manner: viewing n agents as a whole, adopting Q-Learning to this joint agent and forcing them to choose the same joint option at a time) to train the high-level policy and adopt Value Iteration [31] for the low-level policy training. (2) For continuous control tasks (e.g., Figure 4(c) and 4(d)), the tabular RL algorithms mentioned above cannot work. Instead, to improve the scalability of the our method, we adopt MAPPO [32] to train the high-level policy of the joint options, and Soft Actor-Critic [33] to train the low-level policy, which are SOTA deep MARL and RL algorithms respectively. ... we set the training horizon as 5000 episodes which is five times that of the tabular methods and the networks are trained for ten iterations in each episode, to maintain the fairness. |