Efficient Wasserstein Natural Gradients for Reinforcement Learning

Authors: Ted Moskovitz, Michael Arbel, Ferenc Huszar, Arthur Gretton

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on challenging tasks demonstrate improvements in both computational cost and performance over advanced baselines. (Abstract) ... We now test the performance of our estimators for both policy gradients (PG) and evolution strategies (ES) against their associated baseline methods. (Section 5, Experiments)
Researcher Affiliation Academia Ted Moskovitz 1, Michael Arbel 1, Ferenc Huszar1,2 & Arthur Gretton1 1Gatsby Unit, UCL 2University of Cambridge
Pseudocode Yes Algorithm 1: Wasserstein Natural Policy Gradient ... Algorithm 2: Wasserstein Natural Evolution Strategies ... Algorithm 3: Efficient Wasserstein Natural Gradient
Open Source Code Yes Further experimental details can be found in the appendix, and our code is available online1. 1https://github.com/tedmoskovitz/WNPG
Open Datasets Yes We first apply WNPG and BG-WNPG to challenging tasks from Open AI Gym (Brockman et al., 2016) and Roboschool (RS).
Dataset Splits No The paper mentions running experiments with '5 random seeds' and tracking performance over 'Timesteps', but it does not specify explicit training/validation/test dataset splits (e.g., 80/10/10) for reproducibility.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions environments like 'Open AI Gym' and 'Roboschool' and implicitly refers to general software components for deep learning and automatic differentiation. However, it does not list specific version numbers for any software dependencies (e.g., 'PyTorch 1.9' or 'Python 3.8').
Experiment Setup Yes More precisely, for each task we ran a hyperparameter sweep over learning rates in the set {1e-5, 5e-5, 1e-4, 3e-4}, and used the concatenation-of-actions behavioral embedding Φ(τ) = [a0, a1, . . . , a T ] with the base network implementation the same as Dhariwal et al. (2017). (Appendix D.1) ... The WNG hyperparameters were also left the same as in Arbel et al. (2020). Specifically, the number of basis points was set as M = 5, the reduction factor was bounded in the range [0.25, 0.75], and ϵ [1e-10, 1e5]. (Appendix D.1) ... Training is up to 4000 gradient iterations, with λ = .9 and β = .1 unless they are varied. (Appendix D.3)