Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Modelling Bounded Rationality in Multi-Agent Interactions by Generalized Recursive Reasoning

Authors: Ying Wen, Yaodong Yang, Jun Wang

IJCAI 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We contribute both theoretically and empirically. On the theory side, we devise the hierarchical framework of GR2 through probabilistic graphical models and prove the existence of a perfect Bayesian equilibrium. ... On the empirical side, we validate our ﬁndings on a variety of MARL benchmarks. Precisely, we ﬁrst illustrate the hierarchical thinking process on the Keynes Beauty Contest, and then demonstrate signiﬁcant improvements compared to state-of-the-art opponent modeling baselines on the normal-form games and the cooperative navigation benchmark.
Researcher Affiliation	Collaboration	Ying Wen1 , Yaodong Yang1,2 , Jun Wang1 1University College London 2Huawei Research & Development U.K. EMAIL
Pseudocode	Yes	Algorithm 1 GR2 Soft Actor-Critic Algorithm
Open Source Code	Yes	The experiment code and appendix are available at https://github. com/ying-wen/gr2
Open Datasets	Yes	We start the experiments1 by elaborating how the GR2 model works on Keynes Beauty Contest, and then move onto the normal-form games that have non-trivial equilibria where common MARL methods fail to converge. Finally, we test on the navigation task that requires effective opponent modeling. ... We test the GR2 methods in more complexed Particle World environments [Lowe et al., 2017]
Dataset Splits	No	The paper mentions evaluating on benchmarks and using self-play, but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or absolute counts) in the main text. It defers some details to an appendix: "We leave the detailed hyper-parameter settings and ablation studies in Appendix F due to space limit."
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU models, CPU models, or cloud computing instance types used for running the experiments.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers (e.g., Python 3.x, TensorFlow 2.x, PyTorch 1.x).
Experiment Setup	Yes	We denote k as the highest level of reasoning in GR2-L/M, and adopt k = {1, 2, 3}, λ = 1.5. All results are reported with 6 random seeds.