Log-normality and Skewness of Estimated State/Action Values in Reinforcement Learning

Authors: Liangpeng Zhang, Ke Tang, Xin Yao

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present our empirical results on the skewness of estimated values. There are two purposes in these experiments: (a) to demonstrate how substantial the harm of the skewness can be; (b) to see the improvement provided by collecting more observations, as mentioned in Section 4.1. We conducted experiments in chain MDPs shown in Figure 4.
Researcher Affiliation Academia Liangpeng Zhang1,2, Ke Tang3,1, and Xin Yao3,2 1School of Computer Science and Technology, University of Science and Technology of China 2University of Birmingham, U.K. 3Shenzhen Key Lab of Computational Intelligence, Department of Computer Science and Engineering, Southern University of Science and Technology, China
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No No explicit statement about providing open-source code or a link to a code repository is found in the paper.
Open Datasets No We conducted experiments in chain MDPs shown in Figure 4. ... We also conducted experiments in the complex maze domain [26] in the same manner as above. The maze used is given in Figure 6 (a).
Dataset Splits No In each run of experiment, m observations were collected for each state-action pair, resulting in a data set of size 2mn.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments are mentioned in the paper.
Software Dependencies No The paper discusses algorithms but does not specify any software dependencies with version numbers (e.g., programming languages, libraries, frameworks).
Experiment Setup Yes The empirical and theoretical distributions of estimated state value ˆV π+(s1) with m = 200, n = 20, p = 0.1, r G = 1e6 in 1000 runs is shown in Figure 5 (a). ... under discount factor γ = 0.9. ... Figure 6 (b) shows the empirical distribution of estimated value ˆV π (sstart, no flag) under γ = 0.9 and m = 10 in 1000 runs.