Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization

Authors: Wesley Chung, Valentin Thomas, Marlos C. Machado, Nicolas Le Roux

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental These theoretical findings match our empirical evaluation, which we extend to multi-state MDPs. We also perform an empirical evaluation on multi-step MDPs, showing that baselines have a similar impact in that setting.
Researcher Affiliation Collaboration 1Mila, Mc Gill University 2Mila, University of Montreal 3Deep Mind 4Amii, University of Alberta 5Work partially done at Google Research 6Google Research, Brain Team.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code No The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper.
Open Datasets No The paper does not provide concrete access information (specific link, DOI, repository name, formal citation with authors/year, or reference to established benchmark datasets) for a publicly available or open dataset. It refers to custom environments like a '3-arm bandit problem' and a '10x10 gridworld'.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions the use of 'Ternary (Harper & Weinstein, 2015)' for plotting in Figure 1, but does not provide specific version numbers for other key software components or libraries used in the experimental setup.
Experiment Setup Yes Figure 1: We plot 15 different trajectories of natural policy gradient with softmax parameterization, when using various baselines, on a 3-arm bandit problem with rewards (1, 0.7, 0) and stepsize α = 0.025 and θ0 = (0, 3, 5). Figure 2: Learning curves for 100 runs of 200 steps, on the two-arm bandit, with baseline b = 1 for three different stepsizes α. We use a discount factor γ = 0.99.