Posterior Value Functions: Hindsight Baselines for Policy Gradient Methods

Authors: Chris Nota, Philip Thomas, Bruno C. Da Silva

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We further illustrate the variance reduction properties of posterior value functions on a tabular gridworld domain with partial observability. We compared agents using learned estimates of the posterior, prior, and observation value functions as baselines for the policy gradient theorem.
Researcher Affiliation Academia 1College of Information and Computer Science, University of Massachusetts, Amherst, MA.
Pseudocode No The paper describes methods using mathematical equations and prose but does not include structured pseudocode or algorithm blocks with explicit labels like 'Algorithm'.
Open Source Code No The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a link to a code repository.
Open Datasets No The paper uses a custom-designed 'tabular gridworld domain' shown in Figure 5, but it does not provide concrete access information (link, DOI, repository, or formal citation) for this dataset to be publicly available.
Dataset Splits No The paper describes an experimental setup within a gridworld environment and mentions training policies, but it does not provide specific train/validation/test dataset splits (percentages, counts, or predefined citations) as it appears to be a simulation environment rather than a static dataset.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running the experiments.
Software Dependencies No The paper mentions 'standard REINFORCE with baselines algorithms' but does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or other libraries).
Experiment Setup No The paper describes the gridworld environment setup and general training approach (REINFORCE with baselines) and how results were averaged, but it defers 'full experimental details' to supplemental material and does not provide specific hyperparameters or system-level training settings in the main text.