State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning

Authors: Shuai Ma, Jia Yuan Yu4512-4519

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The averaged empirical return distribution is from a simulation repeated 50 times with a time horizon 1000, with the error region representing the standard deviations of the means along return axis.
Researcher Affiliation Academia Shuai Ma, Jia Yuan Yu Concordia Institute of Information System Engineering, Concordia University 1455 De Maisonneuve Blvd. W., Montreal, Quebec, Canada H3G 1M8 m shua@encs.concordia.ca, jiayuan.yu@concordia.ca
Pseudocode Yes Algorithm 1 State-Transition Transformation (for Case 0)
Open Source Code No The paper does not provide any explicit statements about releasing source code or direct links to a code repository.
Open Datasets No The paper constructs an MDP for a single-product stochastic inventory control problem based on (Puterman 1994, Section 3.2.1) and defines its parameters (W, c(x), m(x), M, f(x), probabilities), but this is a described problem setup rather than a publicly available dataset with concrete access information.
Dataset Splits No The paper mentions 'a simulation repeated 50 times with a time horizon 1000' but does not specify explicit training, validation, or test dataset splits or cross-validation details for a given dataset.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, specific solvers).
Experiment Setup Yes We set the parameters as follows. The fixed order cost W = 4, the variable order cost c(x) = 2x, the maintenance fee m(x) = x, the warehouse capacity M = 2, and the price f(x) = 8x. The probabilities of demands are P(Dt = 0) = 0.25, P(Dt = 1) = 0.5, P(Dt = 2) = 0.25 respectively. The initial distribution µ(0) = 1. [...] Now we set γ = 0.95 and compare the two return distributions [...]. The averaged empirical return distribution is from a simulation repeated 50 times with a time horizon 1000.