Online Nonstochastic Model-Free Reinforcement Learning

Authors: Udaya Ghai, Arushi Gupta, Wenhan Xia, Karan Singh, Elad Hazan

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method over various standard RL benchmarks and demonstrate improved robustness. ... We empirically evaluate our method on Open AI Gym environments in Section 5.
Researcher Affiliation Collaboration Udaya Ghai Amazon ughai@amazon.com Arushi Gupta Princeton University & Google Deep Mind arushig@princeton.edu Wenhan Xia Princeton University & Google Deep Mind wxia@princeton.edu Karan Singh Carnegie Mellon University karansingh@cmu.edu Elad Hazan Princeton University & Google Deep Mind ehazan@princeton.edu
Pseudocode Yes Algorithm 1 MF-GPC (Model-Free Gradient Perturbation Controller) ... Algorithm 2 DMF-GPC (Discrete Model-Free Gradient Perturbation Controller)
Open Source Code No The paper mentions using and basing its implementation on existing frameworks like Acme and D4PG, but it does not provide an explicit statement about releasing its own source code or a link to a repository for the methodology described.
Open Datasets Yes We apply the MF-GPC Algorithm 1 to various Open AI Gym [Brockman et al., 2016] environments.
Dataset Splits No The paper describes training duration ('1e7 steps', '1.5e7 steps') and averaging results over seeds ('25 seeds'), typical for RL. However, it does not specify explicit dataset splits (e.g., percentages or counts for training, validation, and test sets) as it operates in dynamic reinforcement learning environments rather than on static pre-partitioned datasets.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as CPU or GPU models, or cloud computing specifications.
Software Dependencies No The paper mentions software components like 'Acme', 'DDPG', and 'D4PG', but it does not specify version numbers for these or other relevant software dependencies like programming languages or libraries.
Experiment Setup Yes We pick h = 5 and use the DDPG algorithm [Lillicrap et al., 2016] as our underlying baseline. We update the M matrices every 3 episodes instead of continuously to reduce runtime. We also apply weight decay to line 6 of Algorithm 1. Our implementation is based on the Acme implementation of D4PG. The policy and critic networks both have the default sizes of 256 256 256. We use the Acme default number of atoms as 51 for the network. We run in the distributed setting with 4 agents. The underlying learning rate of the D4PG implementation is left at 3e 04. The exploration parameter, σ is tuned.