Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Evaluating Strategic Structures in Multi-Agent Inverse Reinforcement Learning
Authors: Justin Fu, Andrea Tacchetti, Julien Perolat, Yoram Bachrach
JAIR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our machinery on a classic game theory domain, a physics-based adversarial game, and a larger-scale simulated auction experiment where we show our method can extract accurate valuations for several popular auction mechanisms. |
| Researcher Affiliation | Collaboration | Justin Fu EMAIL University of California, Berkeley Department of Electrical Engineering & Computer Science Berkeley, CA, 94720, USA Andrea Tacchetti EMAIL Julien Perolat EMAIL Yoram Bachrach EMAIL Deep Mind 6 Pancras Square London, N1C 4AG, United Kingdom |
| Pseudocode | Yes | Algorithm 1 Inverse Equilibrium Single-Agent Reduction (IESAR) Method Input: Demonstration samples ˆσ πE1:N (Markov Games) Estimate πE1:N from ˆσ. for player i = 1 to N do Solve the single agent IRL problem with a utility-matching IRL method ˆRi = IRL(πE i |πE i) end for return ˆR1:N |
| Open Source Code | No | The paper does not explicitly provide a link to source code, state that code is released, or mention code in supplementary materials. |
| Open Datasets | No | The paper uses |
| Dataset Splits | No | The paper describes how demonstrations were obtained by simulating games and sampling from policies, or drawing valuations from a Gaussian distribution, but it does not specify explicit training/test/validation dataset splits of pre-existing datasets. For instance: |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. It mentions using |
| Software Dependencies | No | The paper mentions several software components like |
| Experiment Setup | Yes | For all methods, we use a learning rate of 10 3 (selected via grid search between 10 4 and 10 1) for both the policy and utility functions. We constrain the norm of the utility function to c = 100 (selected via grid search between 1 and 1000). In our coordinate descent procedure, we optimize the inner loop (policy optimization) for 10 steps for each outer loop (utility optimization) step. ... For the environment, we used a horizon length of 25 steps... |