Factored Policy Gradients: Leveraging Structure for Efficient Learning in MOMDPs

Authors: Thomas Spooner, Nelson Vadori, Sumitra Ganesh

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The final contribution is to illustrate the effectiveness of our approach over traditional estimators on two high-dimensional benchmark domains. (From Introduction, Contributions section) and Section 5 Numerical Experiments with figures showing performance (e.g., Figure 4, Figure 5).
Researcher Affiliation Industry Thomas Spooner J. P. Morgan AI Research thomas.spooner@jpmorgan.com, Nelson Vadori J. P. Morgan AI Research nelson.vadori@jpmorgan.com, Sumitra Ganesh J. P. Morgan AI Research sumitra.ganesh@jpmorgan.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes In particular, we consider variants of the (3 × 3) grid network benchmark environment as originally proposed by Vinitsky et al. [51] that is provided by the outstanding Flow framework [56, 21].
Dataset Splits No The paper does not specify explicit training, validation, and test splits (e.g., percentages or sample counts) for the datasets or environments used. It mentions using '10 random seeds' for averaging results but not data partitioning.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications, or cloud computing instance types) used for running the experiments.
Software Dependencies No The paper mentions software like the 'Flow framework' and algorithms like 'PPO' and 'GAE', but it does not provide specific version numbers for these or any other software dependencies needed for reproduction.
Experiment Setup Yes In our experiments, the centroids were initialised with a uniform distribution, c ∼ U (−5, 5) and were held fixed between episodes. The policy was defined as an isotropic Gaussian with fixed covariance, diag(1), and initial location vector µ := 0. The parameter vector, µ, was updated at each time step, and the hyperparameters are provided in the appendix.