Truthful Self-Play
Authors: Shohei Ohsawa
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments with predator prey, traffic junction and Star Craft tasks demonstrate that the state-of-the-art performance of our framework. As the second contribution, based on the results of numerical experiments, we report that the TSP achieved state-of-the-art performance for various multi-agent tasks made of up to 20 agents (Section 5). |
| Researcher Affiliation | Industry | Shohei Ohsawa Founder & CEO Daisy AI 6-13-9 Ginza, Chuo-ku, Tokyo, Japan o@daisy.inc |
| Pseudocode | Yes | We show the whole procedure in Algorithm 1. Algorithm 1 The truthful self-play (TSP). |
| Open Source Code | No | The paper does not provide a specific link or explicit statement about releasing the source code for the described methodology. |
| Open Datasets | Yes | Using predator prey (Barrett et al., 2011), traffic junction (Sukhbaatar et al., 2016; Singh et al., 2019), and Star Craft (Synnaeve et al., 2016) environments, which are typically used in Comm-POSG research, we compared the performances of TSP with the current neural nets. |
| Dataset Splits | No | The paper mentions the use of specific environments (predator prey, traffic junction, Star Craft) but does not provide explicit details on dataset splits (e.g., percentages or sample counts for training, validation, or testing sets). |
| Hardware Specification | No | We performed 2,000 epochs of experiment with 500 steps, each using 120 CPUs. (No specific CPU model or other hardware details are provided). |
| Software Dependencies | No | The paper mentions "deep learning software libraries such as Tensor Flow and Py Torch" but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Table 5: Hyperparameters used in the experiment. β are grid searched in space {0.1, 1, 10, 100}, and the best parameter is shown. The other parameters are not adjusted. Notation Value Agents n {3, 5, 10, 20} Observation xti X R9 Internal state hti H R64 Message zti Z R64 Actions ati A { , , , , astop} True state st S {0, 1}25 400 Episode length T 20 Learning rate α 0.001 Truthful rate β 10 Discount rate γ 1.0 Metrics ψ H [ ] |