Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MarioGPT: Open-Ended Text2Level Generation through Large Language Models
Authors: Shyam Sudhakaran, Miguel González-Duque, Matthias Freiberger, Claire Glanois, Elias Najarro, Sebastian Risi
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here, we introduce Mario GPT, a fine-tuned GPT2 model trained to generate tile-based game levels, in our case Super Mario Bros levels. Mario GPT can not only generate diverse levels, but can be text-prompted for controllable level generation, addressing one of the key challenges of current PCG techniques. ... 4 Experiments and Results |
| Researcher Affiliation | Collaboration | Shyam Sudhakaran1, Miguel González-Duque 1, Matthias Freiberger 1, Claire Glanois1, Elias Najarro1, Sebastian Risi1,2 1IT University of Copenhagen, 2modl.ai, Copenhagen |
| Pseudocode | No | The paper describes processes and models (e.g., the novelty search setup and mutation operators in Figure 3) but does not include a formally labeled 'Pseudocode' or 'Algorithm' block with structured steps. |
| Open Source Code | Yes | Code available at https://github.com/shyamsn97/mario-gpt. ... To facilitate this, the code to run the experiments in this paper is publicly available at: https://github.com/shyamsn97/mario-gpt. |
| Open Datasets | Yes | Mario levels are represented similarly to previous works [45, 8, 35, 33, 34, 12], using the levels provided in the Video Game Level Corpus (VGLC) [40]. |
| Dataset Splits | No | While Table 1 is titled 'Training Reconstruction Accuracy Validation Set', the paper does not explicitly provide specific percentages, absolute sample counts, or a detailed methodology for the training/validation/test dataset splits needed to reproduce the experiment. |
| Hardware Specification | Yes | Because the model is relatively small, it can be trained using a single Nvidia Ge Force RTX 2080 Ti GPU. |
| Software Dependencies | No | The paper mentions utilizing 'the open source transformers library' and 'distilgpt2' but does not specify concrete version numbers for these or other software dependencies required to replicate the experiment. |
| Experiment Setup | Yes | We train Mario GPT for 50,000 steps, sampling 4 random slices of levels at each iteration and optimize the model using the Adam optimizer [20]. ... In our case, when generating levels we use a temperature of 2.4-2.7. |