Beyond Rewards: a Hierarchical Perspective on Offline Multiagent Behavioral Analysis

Authors: Shayegan Omidshafiei, Andrei Kapishnikov, Yannick Assogba, Lucas Dixon, Been Kim

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments investigate the structure of the learned behavior space, which goes beyond prior works on latent-clustering by identifying relationships between individual agent and joint behaviors. We illustrate that clusters identified by MOHBA are useful for highlighting similarities and differences in behaviors throughout training. We also quantitatively analyze the completeness of discovered behavior clusters by adopting a modified version of the concept-discovery framework of Yeh et al. [13] to identify interesting behavior concepts in our multiagent setting. We then test the scalability of our approach by using it for behavioral analysis of several high-dimensional multiagent Mu Jo Co environments [14]. Finally, we evaluate the approach on the open-sourced Open AI hide-and-seek policy checkpoints [10], confirming that the behavioral clusters detected by MOHBA closely match those of the human-expert annotated labels provided in their policy checkpoints.
Researcher Affiliation Industry Shayegan Omidshafiei somidshafiei@google.com Andrei Kapishnikov kapishnikov@google.com Yannick Assogba yassogba@google.com Lucas Dixon ldixon@google.com Been Kim beenkim@google.com Google Research
Pseudocode Yes Appendix A.7 provides pseudocode.
Open Source Code No The paper states: “We provide details for experiment reproducibility in Appendix A.2. We also include model highlevel code in Appendix A.7.” Appendix A.7 contains pseudocode, not runnable open-source code for the methodology, and no external link is provided.
Open Datasets Yes Finally, we evaluate the approach on the open-sourced Open AI hide-and-seek policy checkpoints [10], confirming that the behavioral clusters detected by MOHBA closely match those of the human-expert annotated labels provided in their policy checkpoints.
Dataset Splits Yes We create an 80-20 train-validation split, then train a 2-layer (8 hidden units each) MLP g via a softmax-cross entropy loss to predict the classes using only zω as input (rather than the actual trajectory τ).
Hardware Specification No The paper states “We provide all computational details in Appendix A.2.”, but Appendix A.2 does not specify hardware details such as GPU/CPU models or types of compute resources used.
Software Dependencies No The paper mentions software like “Acme RL library”, “TD3 algorithm”, “RLDS”, “PyTorch”, and “Adam optimizer” but does not provide specific version numbers for these dependencies.
Experiment Setup Yes Appendix A.2 provides hyperparameters such as ADAM optimizer with a learning rate of 1e-4, batch size of 256, latent dimensions for zω and zα of 2 and 4, respectively, and β values of 0.05 and 0.01 for the hill-climbing and coordination game, and 0.005 for the Half Cheetah and Ant domains.