Bias in Natural Actor-Critic Algorithms
Authors: Philip Thomas
ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We show that several popular discounted reward natural actor-critics, including the popular NAC-LSTD and e NAC algorithms, do not generate unbiased estimates of the natural policy gradient as claimed. We derive the first unbiased discounted reward natural actor-critics using batch and iterative approaches to gradient estimation. We argue that the bias makes the existing algorithms more appropriate for the average reward setting. We also show that, when Sarsa(λ) is guaranteed to converge to an optimal policy, the objective function used by natural actor-critics has only global optima, so policy gradient methods are guaranteed to converge to globally optimal policies as well. |
| Researcher Affiliation | Academia | Philip S. Thomas PTHOMAS@CS.UMASS.EDU School of Computer Science, University of Massachusetts, Amherst, MA 01002 USA |
| Pseudocode | Yes | Algorithm 1 episodic Natural Actor Critic 2 e NAC2; Algorithm 2 Natural Actor Critic using Sarsa(λ) NAC-S(λ) |
| Open Source Code | No | The paper does not provide CONCRETE ACCESS TO SOURCE CODE (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper. |
| Open Datasets | No | The paper describes a synthetic MDP example (e.g., 'MDP with S = [0, 1]...'), but does not provide CONCRETE ACCESS INFORMATION (specific link, DOI, repository name, formal citation with authors/year, or reference to established benchmark datasets) for a publicly available or open dataset. |
| Dataset Splits | No | The paper does not provide SPECIFIC DATASET SPLIT INFORMATION (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide SPECIFIC HARDWARE DETAILS (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide SPECIFIC ANCILLARY SOFTWARE DETAILS (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment. |
| Experiment Setup | No | The paper mentions some conceptual settings for its illustrative example (e.g., 'We parameterize the policy with one parameter, such that at N(θ, σ2)... We used random restarts for all methods'), but it does not provide SPECIFIC EXPERIMENTAL SETUP DETAILS (concrete hyperparameter values, training configurations, or system-level settings) in the main text that would allow full reproduction of an experiment. |