Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Safe Search for Stackelberg Equilibria in Extensive-Form Games

Authors: Chun Kai Ling, Noam Brown5541-5548

AAAI 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We prove that our search technique is guaranteed to perform no worse than the pre-computed blueprint strategy, and empirically demonstrate that it enables approximately solving significantly larger games compared to purely offline methods. We evaluate our method on a two-stage matrix game, the classic game of Goofspiel, and a larger, general-sum variant of Leduc hold em. We demonstrate that in large games our search algorithm outperforms offline methods while requiring significantly less computation, and that this improvement increases with the size of the game.
Researcher Affiliation	Collaboration	Chun Kai Ling 1 , Noam Brown 2 1 Carnegie Mellon University 2 Facebook AI Research
Pseudocode	Yes	The pseudocode is given in Algorithm 1.
Open Source Code	Yes	Our implementation is publicly available online: https://github.com/lingchunkai/Safe Search SSE.
Open Datasets	Yes	We evaluate our method on a two-stage matrix game, the classic game of Goofspiel, and a larger, general-sum variant of Leduc hold em. (Ross 1971) and (Southey et al. 2012) are cited for Goofspiel and Leduc hold em respectively, indicating they are publicly defined games often used as benchmarks.
Dataset Splits	No	The paper describes the games and scenarios used for evaluation but does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts) for any of the datasets/games used.
Hardware Specification	Yes	Experiments were conducted on a Intel i7-7700K @ 4.20GHz with 4 cores and 64GB of RAM.
Software Dependencies	Yes	We use the commercial solver Gurobi (Gurobi Optimization 2021) to solve all instances of MILPs.
Experiment Setup	Yes	For full-game solving, we allowed Gurobi to run for a maximum of 1000s. For search, we allowed 100 seconds in practice, this never exceeds more than 20 seconds for any subgame. In all cases, we warm-start Gurobi with the blueprint strategy.