Learning Parametric Closed-Loop Policies for Markov Potential Games

Authors: Sergio Valcarcel Macua, Javier Zazo, Santiago Zazo

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate the theoretical contributions with an example by applying our approach to a noncooperative communications engineering game. We then solve the game with a deep reinforcement learning algorithm that learns policies that closely approximates an exact variational NE of the game. In this section, we show how to use the proposed MPGs framework to learn an equilibrium of a communications engineering application. As a proof of concept, we perform simulations with TRPO, approximating the policy with a neural network with 3 hidden layers of size 32 neurons per layer and RELU activation function...
Researcher Affiliation Collaboration Sergio Valcarcel Macua PROWLER.io Cambridge, UK sergio@prowler.io Javier Zazo, Santiago Zazo Information Processing and Telecommunications Center Universidad Politécnica de Madrid Madrid, Spain javier.zazo.ruiz@upm.es santiago@gaps.ssr.upm.es
Pseudocode No The paper describes methods in text and mathematical formulations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code.
Open Source Code No The paper does not provide a specific link to source code nor an explicit statement confirming that the source code for the described methodology is publicly available.
Open Datasets No The paper states: 'To surmount this issue, we generated 100 independent sequences of samples of hk,i and δk,i for all k N and length T = 100 time steps each, and obtain two solutions with them.' This indicates self-generated data, but no access information (link, citation, or repository) for a publicly available or open dataset is provided.
Dataset Splits No The paper describes generating sequences for benchmarking and training a DRL agent that learns by interacting with a simulator, but it does not specify explicit training, validation, or test dataset splits.
Hardware Specification No The paper mentions running simulations and training a neural network but does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for its experiments.
Software Dependencies No The paper mentions using the 'Trust Region Policy Optimization (TRPO) algorithm' and refers to 'CVX' for convex optimization, but it does not specify version numbers for these or any other software components.
Experiment Setup Yes As a proof of concept, we perform simulations with TRPO, approximating the policy with a neural network with 3 hidden layers of size 32 neurons per layer and RELU activation function, and an output layer that is the mean of a Gaussian distribution. Each iteration of TRPO uses a batch of size 4000 simulation steps (i.e., tuples of state transition, action and rewards). The step-size is 0.01.