reproducibilityindex.ai

GAVEL: Generating Games via Evolution and Language Models

Authors: Graham Todd, Alexander G Padula, Matthew Stephenson, Eric Piette, Dennis Soemers, Julian Togelius

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate both quantitatively and qualitatively that our approach is capable of generating new and interesting games, including in regions of the potential rules space not covered by existing games in the Ludii dataset. and We show empirically that GAVEL is capable of generating playable and interesting board games that differ substantially from games encountered during training.
Researcher Affiliation	Academia	Graham Todd New York University Tandon Brooklyn, New York, USA gdrtodd@nyu.edu Alexander G. Padula ETH Zurich Zurich, Switzerland apadula@ethz.ch Matthew Stephenson Flinders University Adelaide, Australia matthew.stephenson@flinders.edu.au Éric Piette UCLouvain Louvain-la-Neuve, Belgium eric.piette@uclouvain.be Dennis J.N.J. Soemers Maastricht University Maastricht, the Netherlands dennis.soemers@maastrichtuniversity.nl Julian Togelius New York University Tandon Brooklyn, New York, USA julian.togelius@nyu.edu
Pseudocode	Yes	Algorithm 1 GAVEL Game Evaluation
Open Source Code	Yes	2Code and data available here: https://github.com/gdrtodd/gavel
Open Datasets	Yes	We construct our initial game dataset out of the 1182 existing games that have been translated into the Ludii game description language (available under a Creative Commons BY-NC-ND 4.0 license). and We provide a link to a public repository that includes our code and data, including a trained model checkpoint, as a footnote at the end of the introduction and here: https://github.com/gdrtodd/gavel
Dataset Splits	Yes	From this reduced dataset, we hold out a set of 14 varied games (available in Appendix A) that are used to initialize the evolutionary search, with the remaining 574 games being used as our training dataset.
Hardware Specification	Yes	Training took approximately 40 hours to complete on a single RTX8000 GPU. and Each run lasted roughly 48 hours using a single RTX8000 GPU for inference from the Code Llama-13b model and performing evaluations in parallel with 16 CPU cores and 128GB of total memory.
Software Dependencies	No	The paper mentions using 'Code Llama [52] (specifically Code Llama-13b),' 'parameter-efficient fine-tuning [39] and 8-bit quantization [22],' and the 'Pyribs library [60].' However, it does not specify version numbers for the programming language (e.g., Python), deep learning framework (e.g., PyTorch), or other core software dependencies needed for replication.
Experiment Setup	Yes	We fine-tune the model for a single epoch with hyperparameters available in Appendix B. and Appendix B lists: Number of epochs: 1, Batch size: 1, Sequence length: 1024, Optimizer: Adam W [37], Learning rate: 3e-4, Warmup Ratio: 0.03, Lo RA Alpha: 16, Lo RA Dropout: 0.05, Lo RA r: 64. Also, Section 5 states: For each run, we select j = 3 games and generate k = 3 mutations for each game at each step. and Section 4.2 states: We then sample from the trained Code Llama-13b model with a temperature of 1 and a top-k value of 50 to generate a new expression