Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Can Large Language Models Master Complex Card Games?

Authors: Wei Wang, Fuqing Bie, Junzhe Chen, Dan Zhang, Shiyu Huang, Evgeny Kharlamov, Jie Tang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we explore the potential of LLMs in mastering complex card games. We systematically assess the learning capabilities of LLMs across eight diverse card games, evaluating the impact of fine-tuning on high-quality gameplay data, and examining the models ability to retain general capabilities while mastering these games. Our findings indicate that: (1) LLMs can approach the performance of strong game AIs through supervised fine-tuning on high-quality data, (2) LLMs can achieve a certain level of proficiency in multiple complex card games simultaneously, with performance augmentation for games with similar rules and conflicts for dissimilar ones, and (3) LLMs experience a decline in general capabilities when mastering complex games, but this decline can be mitigated by integrating a certain amount of general instruction data. The evaluation results demonstrate strong learning ability and versatility of LLMs.
Researcher Affiliation	Collaboration	1Nankai University, 2Tsinghua University, 3Beijing University of Posts and Telecommunications 4Zhipu AI, 5Bosch Center for Artificial Intelligence
Pseudocode	No	The paper includes prompt templates in Appendix A.3 (Figure 1 to Figure 8) which describe structured instructions for the LLM's input and output format. However, these are not pseudocode or algorithm blocks that outline the methodology or a computational procedure of the system itself, but rather define the interface for interaction.
Open Source Code	Yes	The code is available at https://github.com/THUDM/ LLM4Card Game
Open Datasets	Yes	For Riichi Mahjong, we download the match data of human professional players from the Tenhou8 platform for the year 2020. Additionally, we analyze the performance variations of language models with different parameter sizes and types (Qwen2.5 [29], Llama3.1 [30], and GLM4 [9]). Finally, we evaluate whether the models general capabilities decline using MMLU-Pro [12], Math-500 [27], and Human Eval [28] benchmarks for knowledge question answering, math, and coding skills. The knowledge data, mathematics data, and coding data are taken from part of Tulu3 s post-training data [46], as this model has made all its post-training data open source.
Dataset Splits	No	For Dou Dizhu, Guan Dan, and Riichi Mahjong, we sample 1,000k instances as training data. For Uno, Gin Rummy, Leduc Hold em, Limit Texas Hold em, and No-limit Texas Hold em, we sample 400k instances as training data. ... We fine-tune the language model separately on each game s data and then evaluate its performance on the respective game. ... The number of games for the eight games are 1000, 20, 50, 500, 100, 1000, 1000, and 1000, respectively.
Hardware Specification	Yes	We conduct experiments on a server with 8 H100 GPUs. For the model in Figure 2a of our paper, we fine-tuned using a single server equipped with 2 Intel(R) Xeon(R) Platinum 8476C CPUs and 8 H800 GPUs on a dataset with 1 million samples.
Software Dependencies	No	We fine-tune all models with LLa MA-Factory Framework [43] and use Lo RA fine-tuning [44].
Experiment Setup	Yes	The Lo RA rank and Lo RA alpha are set to 8 and 16, respectively. We fine-tune all models with 1 epoch. We apply a peak of 1e-4 learning rate with a cosine scheduler. The batch size is 128. We conduct experiments on a server with 8 H100 GPUs.