Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization
Authors: Wesley Chung, Valentin Thomas, Marlos C. Machado, Nicolas Le Roux
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | These theoretical findings match our empirical evaluation, which we extend to multi-state MDPs. We also perform an empirical evaluation on multi-step MDPs, showing that baselines have a similar impact in that setting. |
| Researcher Affiliation | Collaboration | 1Mila, Mc Gill University 2Mila, University of Montreal 3Deep Mind 4Amii, University of Alberta 5Work partially done at Google Research 6Google Research, Brain Team. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures). |
| Open Source Code | No | The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper. |
| Open Datasets | No | The paper does not provide concrete access information (specific link, DOI, repository name, formal citation with authors/year, or reference to established benchmark datasets) for a publicly available or open dataset. It refers to custom environments like a '3-arm bandit problem' and a '10x10 gridworld'. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions the use of 'Ternary (Harper & Weinstein, 2015)' for plotting in Figure 1, but does not provide specific version numbers for other key software components or libraries used in the experimental setup. |
| Experiment Setup | Yes | Figure 1: We plot 15 different trajectories of natural policy gradient with softmax parameterization, when using various baselines, on a 3-arm bandit problem with rewards (1, 0.7, 0) and stepsize α = 0.025 and θ0 = (0, 3, 5). Figure 2: Learning curves for 100 runs of 200 steps, on the two-arm bandit, with baseline b = 1 for three different stepsizes α. We use a discount factor γ = 0.99. |