Truthful Self-Play

Authors: Shohei Ohsawa

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments with predator prey, traffic junction and Star Craft tasks demonstrate that the state-of-the-art performance of our framework. As the second contribution, based on the results of numerical experiments, we report that the TSP achieved state-of-the-art performance for various multi-agent tasks made of up to 20 agents (Section 5).
Researcher Affiliation Industry Shohei Ohsawa Founder & CEO Daisy AI 6-13-9 Ginza, Chuo-ku, Tokyo, Japan o@daisy.inc
Pseudocode Yes We show the whole procedure in Algorithm 1. Algorithm 1 The truthful self-play (TSP).
Open Source Code No The paper does not provide a specific link or explicit statement about releasing the source code for the described methodology.
Open Datasets Yes Using predator prey (Barrett et al., 2011), traffic junction (Sukhbaatar et al., 2016; Singh et al., 2019), and Star Craft (Synnaeve et al., 2016) environments, which are typically used in Comm-POSG research, we compared the performances of TSP with the current neural nets.
Dataset Splits No The paper mentions the use of specific environments (predator prey, traffic junction, Star Craft) but does not provide explicit details on dataset splits (e.g., percentages or sample counts for training, validation, or testing sets).
Hardware Specification No We performed 2,000 epochs of experiment with 500 steps, each using 120 CPUs. (No specific CPU model or other hardware details are provided).
Software Dependencies No The paper mentions "deep learning software libraries such as Tensor Flow and Py Torch" but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes Table 5: Hyperparameters used in the experiment. β are grid searched in space {0.1, 1, 10, 100}, and the best parameter is shown. The other parameters are not adjusted. Notation Value Agents n {3, 5, 10, 20} Observation xti X R9 Internal state hti H R64 Message zti Z R64 Actions ati A { , , , , astop} True state st S {0, 1}25 400 Episode length T 20 Learning rate α 0.001 Truthful rate β 10 Discount rate γ 1.0 Metrics ψ H [ ]