reproducibilityindex.ai

An operator view of policy gradient methods

Authors: Dibya Ghosh, Marlos C. Machado, Nicolas Le Roux

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the PG updates on expectation, not their stochastic variants. Thus, our presentation and analyses use the true gradient of the functions of interest. ... Empirical analysis Although using an α-divergence is necessary to maintain π(θ ) as stationary point, it is possible that using the KL will still lead to faster convergence early in training. We studied the effect of this family of improvement operators Iα for different choices of α in the four-room domain [22] (Figure 1).
Researcher Affiliation	Industry	Dibya Ghosh Google Brain Marlos C. Machado Google Brain Nicolas Le Roux Google Brain
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It describes methods and operators in prose and mathematical notation.
Open Source Code	No	The paper does not provide concrete access to source code. There is no mention of code release, no repository links, nor any statement about code in supplementary materials.
Open Datasets	No	The paper mentions evaluating in the "four-room domain [22]" but does not provide concrete access information (specific link, DOI, repository name, formal citation for the dataset itself) for a publicly available or open dataset. The four-room domain is an environment, not a pre-packaged dataset used in the traditional sense for training.
Dataset Splits	No	The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce data partitioning for training, validation, and test sets. It describes an RL environment rather than a fixed dataset with splits.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment.
Experiment Setup	Yes	The policy is parameterized by a softmax and all states share the same parameters, i.e. we use function approximation. ... One can use Iα with the KL projection step heuristically, by selecting an aggressive improvement operator Iα (low α) early in optimization, and annealing α to 1 to recover OP-REINFORCE updates asymptotically. We present results of using line search to dynamically anneal the value of α as the policy converges (details in Appendix F.1).