Joint Policy Search for Multi-agent Collaboration with Imperfect Information
Authors: Yuandong Tian, Qucheng Gong, Yu Jiang
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On multi-agent collaborative tabular games, JPS is proven to never worsen performance and can improve solutions provided by unilateral approaches (e.g., CFR [44]), outperforming algorithms designed for collaborative policy learning (e.g. BAD [16]). Furthermore, for real-world game with exponential states, JPS has an online form that naturally links with gradient updates. We test it to Contract Bridge, a 4-player imperfect-information game where a team of 2 collaborates to compete against the other. |
| Researcher Affiliation | Industry | Yuandong Tian Facebook AI Research yuandong@fb.com Qucheng Gong Facebook AI Research qucheng@fb.com Tina Jiang Facebook AI Research tinayujiang@fb.com |
| Pseudocode | Yes | Algorithm 1 Joint Policy Search (Tabular form) |
| Open Source Code | Yes | Part of the code is released in https://github.com/facebookresearch/jps. |
| Open Datasets | Yes | We generate a training set of 2.5 million hands, drawn from uniform distribution on permutations of 52 cards. We pre-compute their DDS results. The evaluation dataset contains 50k such hands. Both datasets will be open sourced for the community and future work. |
| Dataset Splits | No | The paper mentions a "training set of 2.5 million hands" and an "evaluation dataset contains 50k such hands" but does not explicitly specify a validation set or clear training/testing/validation split percentages or sample counts for these datasets. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used, such as CPU or GPU models, or cloud computing specifications. It mentions "massive computational resources" in general terms. |
| Software Dependencies | No | The paper mentions software like A2C, CFR, and BAD, but does not specify version numbers for any of the software dependencies or libraries used in their implementation. |
| Experiment Setup | Yes | During training we run 2000 games in parallel, use batch size of 1024, an entropy ratio of 0.01 and with no discount factor. ... JPS uses a search depth of D = 3 ... After the P1 s turn, we rollout 5 times to sample opponent s actions under σ. After P2 s turn, we rollout 5 times following σ to get an estimate of vσ(h). Therefore, for each initial state h0, we run 5 × 5 rollouts for each combination of policy candidates of P1 and P2. Only a small fraction (e.g., 5%) of the games stopped at some game state and run the search procedure above. |