An operator view of policy gradient methods
Authors: Dibya Ghosh, Marlos C. Machado, Nicolas Le Roux
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study the PG updates on expectation, not their stochastic variants. Thus, our presentation and analyses use the true gradient of the functions of interest. ... Empirical analysis Although using an α-divergence is necessary to maintain π(θ ) as stationary point, it is possible that using the KL will still lead to faster convergence early in training. We studied the effect of this family of improvement operators Iα for different choices of α in the four-room domain [22] (Figure 1). |
| Researcher Affiliation | Industry | Dibya Ghosh Google Brain Marlos C. Machado Google Brain Nicolas Le Roux Google Brain |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. It describes methods and operators in prose and mathematical notation. |
| Open Source Code | No | The paper does not provide concrete access to source code. There is no mention of code release, no repository links, nor any statement about code in supplementary materials. |
| Open Datasets | No | The paper mentions evaluating in the "four-room domain [22]" but does not provide concrete access information (specific link, DOI, repository name, formal citation for the dataset itself) for a publicly available or open dataset. The four-room domain is an environment, not a pre-packaged dataset used in the traditional sense for training. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce data partitioning for training, validation, and test sets. It describes an RL environment rather than a fixed dataset with splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment. |
| Experiment Setup | Yes | The policy is parameterized by a softmax and all states share the same parameters, i.e. we use function approximation. ... One can use Iα with the KL projection step heuristically, by selecting an aggressive improvement operator Iα (low α) early in optimization, and annealing α to 1 to recover OP-REINFORCE updates asymptotically. We present results of using line search to dynamically anneal the value of α as the policy converges (details in Appendix F.1). |