Nonlinear Distributional Gradient Temporal-Difference Learning
Authors: Chao Qu, Shie Mannor, Huan Xu
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6. Experimental result In this section, firstly we use a simple grid world experiment to test the convergence of our distributional GTD2 and distributional TDC in the off-policy setting. Then we assess the empirical performance of them and compare the performance with their non-distributional counterparts, namely, GTD2 and TDC. Particularly, we use a simple cartpole problem to test the algorithm, where we do several policy evaluation steps to get a accurate estimation of value function and then do a policy improvement step. To apply distributional GTD2 or distributional TDC, we use a neural network to approximate the distribution function Fθ((s, a), z). Particularly, in both experiments, we use a neural network with one hidden layer. The inputs are state-action pairs, and the output is a softmax function. There are 50 hidden units and we choose the number of atoms as 30 in the distribution, i.e., the number of outputs in softmax function is 30. |
| Researcher Affiliation | Collaboration | 1Ant Financial Services Group, Hang Zhou, China 2Faculty of Electrial Engineering, Technion, Haifa, Israel 3Alibaba Group, Seattle, USA 4H. Milton Stewart School of Industrial and Systems Engineering, Georgia Tech, Atlanta, USA. |
| Pseudocode | Yes | Algorithm 1 Distributional GTD2 for policy evaluation |
| Open Source Code | No | The paper does not contain any explicit statement or link indicating the availability of open-source code for the described methodology. |
| Open Datasets | Yes | Particularly, we test the algorithm in the environment Cartpole v0, lunarlander v2 in the openai gym (Brockman et al., 2016) and vizdoom (Kempka et al., 2016). |
| Dataset Splits | No | The paper does not provide specific details on training, validation, and test dataset splits, such as percentages or sample counts. |
| Hardware Specification | No | The paper mentions using neural networks and discussing computational complexity, but it does not provide any specific details about the hardware used for experiments, such as GPU/CPU models or memory. |
| Software Dependencies | No | The paper mentions using OpenAI Gym, VizDoom, and Adam optimizer, but it does not specify any version numbers for these or other software dependencies. |
| Experiment Setup | Yes | In the experiment, we choose discount factor γ = 0.9. Since reward is bounded in [0, 1] in cartpole problem, in distributional GTD2 and distributional TDC, we choose Vmin = 0 and Vmax = 10... in the control problem (cartpole), we use the ϵ-greedy policy over the expected action values, where ϵ starts at 0.1 and decreases gradually to 0.02... For the environment Cartpole and lunarlander, to implement distributional Greedy-GQ and C51, we use two hidden-layer neural network with 64 hidden units to approximate the value distribution where activation functions are relu. The outputs are softmax functions with 40 units to approximate the probability atoms. We apply Adam with learning rate 5e-4 to train the agent. In vizdoom experiment, the first three layers are CNN and then it follows a dense layer where all activation functions are relu. The outputs are softmax functions with 50 units. We set Vmin = 10 and Vmax = 20 in the experiment. |