Reinforcement Learning When All Actions Are Not Always Available

Authors: Yash Chandak, Georgios Theocharous, Blossom Metevier, Philip Thomas3381-3388

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we use empirical studies to answer the following three questions: (a) How do our proposed algorithms, SAS policy gradient (SAS-PG) and SAS natural policy gradient (SAS-NPG), compare to the prior method SAS-Qlearning? (b) How does our adaptive variance reduction technique weight the two baselines over the training duration? (c) What impact does the probability of action availability have on the performances of SAS-PG, SAS-NPG, and SAS-Q-learning?
Researcher Affiliation Collaboration Yash Chandak,1 Georgios Theocharous,2 Blossom Metevier,1 Philip S. Thomas1 1University of Massachusetts Amherst, 2Adobe Research
Pseudocode Yes This algorithm is presented in Algorithm 12. Pseudo-code for the SAS policy gradient algorithm is provided in Algorithm 12.
Open Source Code No The paper does not contain any explicit statement about releasing source code for the described methodology or a link to a code repository.
Open Datasets Yes This task models the problem of finding shortest paths in San Francisco, and was first presented with stochastic actions by Boutilier et al. (2018).
Dataset Splits No The paper does not specify exact training, validation, and test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., libraries, frameworks, or programming languages) that were used for the experiments.
Experiment Setup No The paper mentions "learning-rate hyper-parameters" (ηϖ, ηω, ηθ and ηλ) and states that initial λ values are 0.5, but it does not provide specific values for the learning rates or other hyperparameters like batch size or number of epochs.