reproducibilityindex.ai

Reinforcement Learning When All Actions Are Not Always Available

Authors: Yash Chandak, Georgios Theocharous, Blossom Metevier, Philip Thomas3381-3388

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we use empirical studies to answer the following three questions: (a) How do our proposed algorithms, SAS policy gradient (SAS-PG) and SAS natural policy gradient (SAS-NPG), compare to the prior method SAS-Qlearning? (b) How does our adaptive variance reduction technique weight the two baselines over the training duration? (c) What impact does the probability of action availability have on the performances of SAS-PG, SAS-NPG, and SAS-Q-learning?
Researcher Affiliation	Collaboration	Yash Chandak,1 Georgios Theocharous,2 Blossom Metevier,1 Philip S. Thomas1 1University of Massachusetts Amherst, 2Adobe Research
Pseudocode	Yes	This algorithm is presented in Algorithm 12. Pseudo-code for the SAS policy gradient algorithm is provided in Algorithm 12.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the described methodology or a link to a code repository.
Open Datasets	Yes	This task models the problem of ﬁnding shortest paths in San Francisco, and was ﬁrst presented with stochastic actions by Boutilier et al. (2018).
Dataset Splits	No	The paper does not specify exact training, validation, and test dataset splits (e.g., percentages or sample counts).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers (e.g., libraries, frameworks, or programming languages) that were used for the experiments.
Experiment Setup	No	The paper mentions "learning-rate hyper-parameters" (ηϖ, ηω, ηθ and ηλ) and states that initial λ values are 0.5, but it does not provide specific values for the learning rates or other hyperparameters like batch size or number of epochs.