Online Nonstochastic Model-Free Reinforcement Learning
Authors: Udaya Ghai, Arushi Gupta, Wenhan Xia, Karan Singh, Elad Hazan
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method over various standard RL benchmarks and demonstrate improved robustness. ... We empirically evaluate our method on Open AI Gym environments in Section 5. |
| Researcher Affiliation | Collaboration | Udaya Ghai Amazon ughai@amazon.com Arushi Gupta Princeton University & Google Deep Mind arushig@princeton.edu Wenhan Xia Princeton University & Google Deep Mind wxia@princeton.edu Karan Singh Carnegie Mellon University karansingh@cmu.edu Elad Hazan Princeton University & Google Deep Mind ehazan@princeton.edu |
| Pseudocode | Yes | Algorithm 1 MF-GPC (Model-Free Gradient Perturbation Controller) ... Algorithm 2 DMF-GPC (Discrete Model-Free Gradient Perturbation Controller) |
| Open Source Code | No | The paper mentions using and basing its implementation on existing frameworks like Acme and D4PG, but it does not provide an explicit statement about releasing its own source code or a link to a repository for the methodology described. |
| Open Datasets | Yes | We apply the MF-GPC Algorithm 1 to various Open AI Gym [Brockman et al., 2016] environments. |
| Dataset Splits | No | The paper describes training duration ('1e7 steps', '1.5e7 steps') and averaging results over seeds ('25 seeds'), typical for RL. However, it does not specify explicit dataset splits (e.g., percentages or counts for training, validation, and test sets) as it operates in dynamic reinforcement learning environments rather than on static pre-partitioned datasets. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as CPU or GPU models, or cloud computing specifications. |
| Software Dependencies | No | The paper mentions software components like 'Acme', 'DDPG', and 'D4PG', but it does not specify version numbers for these or other relevant software dependencies like programming languages or libraries. |
| Experiment Setup | Yes | We pick h = 5 and use the DDPG algorithm [Lillicrap et al., 2016] as our underlying baseline. We update the M matrices every 3 episodes instead of continuously to reduce runtime. We also apply weight decay to line 6 of Algorithm 1. Our implementation is based on the Acme implementation of D4PG. The policy and critic networks both have the default sizes of 256 256 256. We use the Acme default number of atoms as 51 for the network. We run in the distributed setting with 4 agents. The underlying learning rate of the D4PG implementation is left at 3e 04. The exploration parameter, σ is tuned. |