reproducibilityindex.ai

Nonlinear Distributional Gradient Temporal-Difference Learning

Authors: Chao Qu, Shie Mannor, Huan Xu

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	6. Experimental result In this section, ﬁrstly we use a simple grid world experiment to test the convergence of our distributional GTD2 and distributional TDC in the off-policy setting. Then we assess the empirical performance of them and compare the performance with their non-distributional counterparts, namely, GTD2 and TDC. Particularly, we use a simple cartpole problem to test the algorithm, where we do several policy evaluation steps to get a accurate estimation of value function and then do a policy improvement step. To apply distributional GTD2 or distributional TDC, we use a neural network to approximate the distribution function Fθ((s, a), z). Particularly, in both experiments, we use a neural network with one hidden layer. The inputs are state-action pairs, and the output is a softmax function. There are 50 hidden units and we choose the number of atoms as 30 in the distribution, i.e., the number of outputs in softmax function is 30.
Researcher Affiliation	Collaboration	1Ant Financial Services Group, Hang Zhou, China 2Faculty of Electrial Engineering, Technion, Haifa, Israel 3Alibaba Group, Seattle, USA 4H. Milton Stewart School of Industrial and Systems Engineering, Georgia Tech, Atlanta, USA.
Pseudocode	Yes	Algorithm 1 Distributional GTD2 for policy evaluation
Open Source Code	No	The paper does not contain any explicit statement or link indicating the availability of open-source code for the described methodology.
Open Datasets	Yes	Particularly, we test the algorithm in the environment Cartpole v0, lunarlander v2 in the openai gym (Brockman et al., 2016) and vizdoom (Kempka et al., 2016).
Dataset Splits	No	The paper does not provide specific details on training, validation, and test dataset splits, such as percentages or sample counts.
Hardware Specification	No	The paper mentions using neural networks and discussing computational complexity, but it does not provide any specific details about the hardware used for experiments, such as GPU/CPU models or memory.
Software Dependencies	No	The paper mentions using OpenAI Gym, VizDoom, and Adam optimizer, but it does not specify any version numbers for these or other software dependencies.
Experiment Setup	Yes	In the experiment, we choose discount factor γ = 0.9. Since reward is bounded in [0, 1] in cartpole problem, in distributional GTD2 and distributional TDC, we choose Vmin = 0 and Vmax = 10... in the control problem (cartpole), we use the ϵ-greedy policy over the expected action values, where ϵ starts at 0.1 and decreases gradually to 0.02... For the environment Cartpole and lunarlander, to implement distributional Greedy-GQ and C51, we use two hidden-layer neural network with 64 hidden units to approximate the value distribution where activation functions are relu. The outputs are softmax functions with 40 units to approximate the probability atoms. We apply Adam with learning rate 5e-4 to train the agent. In vizdoom experiment, the ﬁrst three layers are CNN and then it follows a dense layer where all activation functions are relu. The outputs are softmax functions with 50 units. We set Vmin = 10 and Vmax = 20 in the experiment.