Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multi-agent KTO: Enhancing Strategic Interactions of Large Language Model in Language Game

Authors: Rong Ye, Yongxin Zhang, yikai zhang, Haoyu Kuang, peng sun, zhongyu wei

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform various experiments, including tournament evaluation, Turing-style detectability test, behavioral analysis, generalization ability test, and ablation studies. The experiments show that Ma KTO achieved an average win rate of 61% in 9-player Seer-Witch-Guard games against various models such as GPT-4o, Claude-3.5, and multi-staged RL agent.
Researcher Affiliation Collaboration Rong Ye1,2 , Yongxin Zhang2 , Yikai Zhang1,2 , Haoyu Kuang1,2 , Peng Sun2 B, Zhongyu Wei1,3 B 1 Fudan University 2 Bytedance Seed 3 Shanghai Innovation Institute EMAIL EMAIL, EMAIL
Pseudocode No The paper describes its methodology through textual explanations and flowcharts (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and data are available at project page https://reneeye.github.io/Ma KTO.html.
Open Datasets Yes We create a large-scale dataset of expert Werewolf players utterances and actions during gameplay, as well as the abundant COT behind their decisions, allowing for effective behavior cloning and fine-tuning of LLMs. Code and data are available at project page https://reneeye.github.io/Ma KTO.html.
Dataset Splits Yes We collected 331 annotated Werewolf games from 17 expert players via our platform, with 51 additional games reserved for LLM evaluation in Sec. 3.3. ... The SFT dataset comprises 25k samples, including 380 samples of fundamental game comprehension data with terminology explanations, 372 Q&As on advanced gaming techniques, 12k annotated authentic gaming behavior data, and 12k general SFT corpus. ... For the Multi-agent KTO phase, we collected 20k preference data entries from the Seer Witch-Guard games, consisting of 12k desirable and 8k unacceptable samples.
Hardware Specification Yes The 14B models are trained using 8 A100 GPUs and the 72B models used 32 A100 GPUs.
Software Dependencies No The paper mentions base models like Qwen2.5-14b-instruct, Qwen2.5-72b-instruct, Llama-3.1-8B-Chinese-Chat, Llama-3.1-70B-Chinese-Chat, and the use of Deep Speed Ze RO-3 optimization, but it does not specify version numbers for general software dependencies (e.g., Python, PyTorch, or Deep Speed itself).
Experiment Setup Yes The SFT dataset comprises 25k samples... We employed Deep Speed Ze RO-3 optimization with a learning rate of 1e 6, a warm-up ratio of 0.05, and trained for 3 epochs. For the Multi-agent KTO phase... We set the KTO hyperparameters with λD = 0.7 and λU = 1.0. The training utilized Deep Speed Ze RO-3 optimization, with a learning rate of 1e 6, a batch size of 2 per device, 150 warmup steps, and train for 20 epochs.