Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Planning with Quantized Opponent Models
Authors: XiaoPeng Yu, Kefan Su, Zongqing Lu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide theoretical analysis of posterior concentration under uncertain opponents, and empirical results demonstrating competitive performance gains over state-of-the-art baselines in benchmark games with partial observability and adversarial dynamics. 4 Experiments |
| Researcher Affiliation | Academia | Xiaopeng Yu Kefan Su Zongqing Lu School of Computer Science, Peking University |
| Pseudocode | Yes | Algorithm 1 Planning with Quantized Opponent Models |
| Open Source Code | No | The paper does not contain an explicit statement about releasing code or a link to a code repository within its main body. |
| Open Datasets | Yes | We evaluate our method across four diverse multi-agent environments, covering a range of interaction structures, including cooperative and competitive dynamics, varying levels of partial observability, and a spectrum of strategic complexity. The Pursuit-Evasion and Predator-Prey tasks [42] are discrete grid-world domains where agents act based on local observations and reactive strategies. The Running-with-Scissors (RWS) environment [5, 53], extends rock-paper-scissors into a spatial, partially observable gridworld. The One-on-One scenario [23, 57] involves high-dimensional state and action spaces, where an attacker and a goal-keeper compete in a simplified football drill. |
| Dataset Splits | No | The paper states: "Each experiment is conducted over 300 episodes", which refers to the evaluation methodology. It describes how trajectories are collected for training latent types and payoff matrices, but it does not specify explicit training, validation, or test dataset splits in the traditional sense for these processes or for the overall experimental setup. |
| Hardware Specification | Yes | All experiments were executed on two dedicated workstations. For the PE, PP, and RWS environments we used an Intel Core i7-12700KF (12th Gen, 3.60 GHz) paired with an NVIDIA Ge Force RTX 4060 Ti. The One-on-One experiments ran on a server equipped with an Intel Xeon E5-2620 v4 (2.10 GHz, 8 cores) and an NVIDIA TITAN Xp. |
| Software Dependencies | No | The paper mentions algorithms and architectures such as "PSRO [25]", "Proximal Policy Optimisation (PPO) [40]", "quantized autoencoder (VQ-VAE) [52]", and "two-layer GRU". However, it does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or solvers used in the implementation. |
| Experiment Setup | Yes | Table 1: Hyperparameters Parameters Pursuit-Evasion Predator-Prey RWS One-on-One hidden units MLP[64, 32] MLP[64, 32] MLP[64, 32] MLP[64, 32] activation function Re LU Re LU Re LU Re LU optimizer Adam Adam Adam Adam learning rate 0.0005 0.0005 0.0005 0.001 target update interval 10 10 10 10 value discount factor 0.99 0.99 0.99 0.99 GAE parameter 0.99 0.99 0.99 0.99 clip parameter 0.115 0.115 0.115 0.115 max grad norm 0.5 0.5 0.5 0.5 | Πi | 10 10 10 10 | Π i | 50 50 50 50 learning rate 0.0001 0.0001 0.0001 0.0001 reconstruction weight 1 1 1 1 batch size 64 64 64 64 letent types K 16 16 16 16 temperature β 1 1 1 1 exploration constant c 2 2 2 2 max depth 20 20 20 20 unit time 1s 1s 2s 5s |