Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DipLLM: Fine-Tuning LLM for Strategic Decision-making in Diplomacy

Authors: Kaixuan Xu, Jiajun Chai, Sicheng Li, Yuqian Fu, Yuanheng Zhu, Dongbin Zhao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a comprehensive evaluation of Dip LLM across various scenarios to assess its effectiveness. First, we evaluate its performance against a pool of baseline opponents... Table 1 presents the performance results for the agents in this population. Dip LLM outperforms all other baselines across every metric... Additionally, we examine the benefits of fine-tuning by comparing the performance of the fine-tuned LLM agent with that of a domain-specific model... Finally, we conduct ablation studies to analyze the contributions of the autoregressive factorization and the fine-tuning process.
Researcher Affiliation Academia 1School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 2State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China.
Pseudocode No The paper describes the methodology using prose, mathematical equations, and figures, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets No To fine-tune our LLM-based autoregressive factorization agent, we collect raw data through interactions between the domain-specific model Dip Net (Paquette et al., 2019a) and the Diplomacy environment. This initial dataset forms the foundation for subsequent training.
Dataset Splits No Figure 6 highlights the consistent performance improvements of both models as the dataset size increased. Initially, Dip Net, a task-specific model pretrained on domain-relevant data, outperforms Dip LLM due to its specialized architecture and prior domain knowledge. In contrast, Dip LLM, which lacks domain-specific pretraining, starts with lower performance. However, Dip LLM demonstrates remarkable data efficiency during fine-tuning. With only 100 games of fine-tuning data, Dip LLM not only matches but surpasses Dip Net s performance. As the dataset size grows to 500 games, Dip LLM achieves a significant lead, outperforming Dip Net by 6.7%. The paper mentions dataset sizes (e.g., 100 games, 500 games) for fine-tuning but does not specify how the data is split into training, validation, or test sets.
Hardware Specification No The paper mentions that the baseline Cicero model requires "up to 448 GPUs for gameplay rollouts" but does not specify the hardware used for the experiments conducted with Dip LLM.
Software Dependencies No Our model is built on the LLa MA 3 8B architecture as the backbone. The paper mentions the model architecture used (LLaMA 3 8B) but does not provide specific version numbers for any software libraries, frameworks, or programming languages used for implementation or training.
Experiment Setup Yes We provide the hyperparameters used for training Dip LLM in Table 4. Our model is built on the LLa MA 3 8B architecture as the backbone. During training, we employ the Low-Rank Adaptation (Lo RA) method (Hu et al., 2022) to update the parameters of the entire LLM. Table 4: Hyperparameter Value, Optimizer Adam W, Lo RA α 32, Dropout Prob 0.05, Batch Size 4, Learning Rate Schedule Linear learning rate 2e-4, Epoch 5, Adaptation 16, Max Seq. Len. 2048.