Learning to Discuss Strategically: A Case Study on One Night Ultimate Werewolf

Authors: Xuanfa Jin, Ziyan Wang, Yali Du, Meng Fang, Haifeng Zhang, Jun Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results on several ONUW game settings demonstrate the effectiveness and generalizability of our proposed framework.
Researcher Affiliation Academia Xuanfa Jin ,1,3, Ziyan Wang ,2, Yali Du ,2, Meng Fang4, Haifeng Zhang ,1,3,5, Jun Wang ,6 1Institute of Automation, Chinese Academy of Sciences, 2 Cooperative AI Lab, Department of Informatics, King s College London, 3School of Artificial Intelligence, University of Chinese Academy of Sciences, 4University of Liverpool, 5Nanjing Artificial Intelligence Research of IA, 6AI Centre, Department of Computer Science, UCL
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The project page of our paper: one-night-ultimate-werewolf.github.io.
Open Datasets No Given the scarcity of datasets containing human players engaged in the ONUW game, we opt to leverage game logs generated by LLMs, which is the most effective way to collect trajectories for offline RL training.
Dataset Splits No The paper collects game logs for training, but does not explicitly provide training/validation/test dataset splits with specific percentages or counts.
Hardware Specification Yes The training of the discussion policy takes an NVIDIA Ge Force RTX 3060 Ti GPU for about 2.5 hours.
Software Dependencies No text-embedding-ada-002 3 is adopted to get the state embeddings and we utilize the CQL [45] to train the discussion policy. More training details can be referred to Appendix E.2. We repeat the game 30 times and report the final results for each evaluation. 3https://platform.openai.com/docs/models 4https://ai.google.dev/models
Experiment Setup Yes The hyperparameters we used for CQL are listed in Table 3 (if not listed, use the default values). Table 3: Training hyperparameters for CQL. Hyper-parameters Value Learning rate 5e-5 Discount factor (γ) 0.99 Mini-batch size 32 Trade-off factor (ρ) 4.0 Critic num 2 Target critic update interval 1000 Epoch num 100 Step num per epoch 5000 State dim 1536 Action dim 6