Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
DipLLM: Fine-Tuning LLM for Strategic Decision-making in Diplomacy
Authors: Kaixuan Xu, Jiajun Chai, Sicheng Li, Yuqian Fu, Yuanheng Zhu, Dongbin Zhao
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a comprehensive evaluation of Dip LLM across various scenarios to assess its effectiveness. First, we evaluate its performance against a pool of baseline opponents... Table 1 presents the performance results for the agents in this population. Dip LLM outperforms all other baselines across every metric... Additionally, we examine the benefits of fine-tuning by comparing the performance of the fine-tuned LLM agent with that of a domain-specific model... Finally, we conduct ablation studies to analyze the contributions of the autoregressive factorization and the fine-tuning process. |
| Researcher Affiliation | Academia | 1School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 2State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China. |
| Pseudocode | No | The paper describes the methodology using prose, mathematical equations, and figures, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | No | To fine-tune our LLM-based autoregressive factorization agent, we collect raw data through interactions between the domain-specific model Dip Net (Paquette et al., 2019a) and the Diplomacy environment. This initial dataset forms the foundation for subsequent training. |
| Dataset Splits | No | Figure 6 highlights the consistent performance improvements of both models as the dataset size increased. Initially, Dip Net, a task-specific model pretrained on domain-relevant data, outperforms Dip LLM due to its specialized architecture and prior domain knowledge. In contrast, Dip LLM, which lacks domain-specific pretraining, starts with lower performance. However, Dip LLM demonstrates remarkable data efficiency during fine-tuning. With only 100 games of fine-tuning data, Dip LLM not only matches but surpasses Dip Net s performance. As the dataset size grows to 500 games, Dip LLM achieves a significant lead, outperforming Dip Net by 6.7%. The paper mentions dataset sizes (e.g., 100 games, 500 games) for fine-tuning but does not specify how the data is split into training, validation, or test sets. |
| Hardware Specification | No | The paper mentions that the baseline Cicero model requires "up to 448 GPUs for gameplay rollouts" but does not specify the hardware used for the experiments conducted with Dip LLM. |
| Software Dependencies | No | Our model is built on the LLa MA 3 8B architecture as the backbone. The paper mentions the model architecture used (LLaMA 3 8B) but does not provide specific version numbers for any software libraries, frameworks, or programming languages used for implementation or training. |
| Experiment Setup | Yes | We provide the hyperparameters used for training Dip LLM in Table 4. Our model is built on the LLa MA 3 8B architecture as the backbone. During training, we employ the Low-Rank Adaptation (Lo RA) method (Hu et al., 2022) to update the parameters of the entire LLM. Table 4: Hyperparameter Value, Optimizer Adam W, Lo RA α 32, Dropout Prob 0.05, Batch Size 4, Learning Rate Schedule Linear learning rate 2e-4, Epoch 5, Adaptation 16, Max Seq. Len. 2048. |