Reinforcement Learning When All Actions Are Not Always Available
Authors: Yash Chandak, Georgios Theocharous, Blossom Metevier, Philip Thomas3381-3388
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we use empirical studies to answer the following three questions: (a) How do our proposed algorithms, SAS policy gradient (SAS-PG) and SAS natural policy gradient (SAS-NPG), compare to the prior method SAS-Qlearning? (b) How does our adaptive variance reduction technique weight the two baselines over the training duration? (c) What impact does the probability of action availability have on the performances of SAS-PG, SAS-NPG, and SAS-Q-learning? |
| Researcher Affiliation | Collaboration | Yash Chandak,1 Georgios Theocharous,2 Blossom Metevier,1 Philip S. Thomas1 1University of Massachusetts Amherst, 2Adobe Research |
| Pseudocode | Yes | This algorithm is presented in Algorithm 12. Pseudo-code for the SAS policy gradient algorithm is provided in Algorithm 12. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | This task models the problem of finding shortest paths in San Francisco, and was first presented with stochastic actions by Boutilier et al. (2018). |
| Dataset Splits | No | The paper does not specify exact training, validation, and test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., libraries, frameworks, or programming languages) that were used for the experiments. |
| Experiment Setup | No | The paper mentions "learning-rate hyper-parameters" (ηϖ, ηω, ηθ and ηλ) and states that initial λ values are 0.5, but it does not provide specific values for the learning rates or other hyperparameters like batch size or number of epochs. |