Online Ad Hoc Teamwork under Partial Observability
Authors: Pengjie Gu, Mengchen Zhao, Jianye Hao, Bo An
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results show that ODITS significantly outperforms various baselines in widely used ad hoc teamwork tasks. In our experimental evaluation, by interacting with a small set of given teammates, the trained agents could robustly collaborate with diverse new teammates. Compared with various type-based baselines, ODITS reveals superior ad hoc teamwork performance. Moreover, our ablations show both the necessity of learning latent variables of teamwork situations and inferring the proxy representations of learned variables. |
| Researcher Affiliation | Collaboration | School of Computer Science and Engineering, Nanyang Technological University, Singapore1 Noah s Ark Lab, Huawei2 College of Intelligence and Computing, Tianjin University3 |
| Pseudocode | Yes | Algorithm 1 ODITS Training Algorithm 2 ODITS Testing |
| Open Source Code | No | The paper does not provide an explicit statement or link to its own open-source code for the described methodology. It mentions using 'the open-source implementation mentioned in (Raileanu et al., 2020)' for visualizing policy representations, which refers to a third-party tool. |
| Open Datasets | No | The paper describes generating its own 'teammate set' by utilizing 5 different MARL algorithms and then manually selecting and partitioning policies into training and testing sets. It does not provide access information (link, DOI, specific citation for download) for a publicly available dataset in the conventional sense. |
| Dataset Splits | No | The paper states, 'Finally, we randomly sampled 8 policies as the training set and the other 7 policies as the testing set.' It does not mention a separate validation set or specific percentages for the splits, nor does it describe cross-validation. |
| Hardware Specification | No | The paper has a section 'A.2 ARCHITECTURE, HYPERPARAMETERS, AND INFRASTRUCTURE' but it only describes hyperparameter settings and training procedures, not specific hardware components like CPU/GPU models or memory details. |
| Software Dependencies | No | The paper mentions using 'RMSprop', 'DQN algorithm', 'VDN', 'QMIX', and 'Py MARL' but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | It is conducted using RMSprop with a learning rate of 5 10 4, α of 0.99, and with no momentum or weight decay. For the lambda value, we search over{1e 5, 1e 4, 5e 4, 1e 3, 5e 3, 1e 2}. We finally adopt λMI = 1e 3,λMI = 1e 3, and λMI = 5e 4 for Modified Coin Game, Predator Prey, and Save the City, respectively, since they induce the best performance compared with other values. For the dimension of the latent variables zi t and ci t, we search over {1, 2, 3, 5, 10} and finally adopt |z| = 10 in Save the city and |z| = 1 in the other environments. In addition, we set |c| = |z|. For exploration, we use ϵ-greedy with ϵ from 1.0 to 0.05 over 50, 000 time steps and kept constant for the rest of the training. Batches of 128 episodes are sampled from the replay buffer, and all components in the framework are trained together in an end-to-end fashion. |