Multi-Step Generalized Policy Improvement by Leveraging Approximate Models

Authors: Lucas N. Alegre, Ana Bazzan, Ann Nowe, Bruno C. da Silva

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate h-GPI on challenging tabular and continuous-state problems under value function approximation and show that it consistently outperforms GPI and state-of-the-art competing methods under various levels of approximation errors. ... We conduct tabular and deep RL experiments in three different domains to evaluate the effectiveness of h-GPI as a method for zero-shot policy transfer.
Researcher Affiliation Academia Lucas N. Alegre1,2 Ana L. C. Bazzan1 Ann Nowé2 Bruno C. da Silva3 1Institute of Informatics, Federal University of Rio Grande do Sul 2Artificial Intelligence Lab, Vrije Universiteit Brussel 3University of Massachusetts {lnalegre,bazzan}@inf.ufrgs.br ann.nowe@vub.be bsilva@cs.umass.edu
Pseudocode Yes Algorithm 1: h-GPI with Successor Features... Algorithm 2: Forward-Pass... Algorithm 3: h-GPI with SFs and FB-DP
Open Source Code Yes The code necessary to reproduce our results is available in the Supplemental Material.
Open Datasets Yes First, we consider the tabular Four Room domain (Barreto et al., 2017). ... The second domain is Reacher (Alegre et al., 2022a)... Finally, we extend the Fetch Push domain (Plappert et al., 2018)...
Dataset Splits Yes The probabilistic neural networks used to approximate the model were trained with early stopping based on a holdout validation subset with instances drawn from the experience buffer B as commonly done when training such networks (Chua et al., 2018; Janner et al., 2019). ... We follow previous works (Borsa et al., 2019; Kim et al., 2022) and use as training tasks the weight vectors that form the standard basis of Rd in all three domains. In Four Room, we use 32 weights vectors equally spaced from the weight simplex {w | Pd i=1 wi = 1, wi 0} as test tasks. For Reacher and Fetch Push, we follow Kim et al. (2022) and use weight vectors defined by { 1, 1}d as test tasks.
Hardware Specification Yes For the tabular experiments, we used an Intel i7-8700 CPU @ 3.20GHz computer with 32GB of RAM. For the experiments involving the function approximation setting, we used computers with NVIDIA A100-PCIE-40GB GPUs.
Software Dependencies No The paper mentions using the 'JAX library (Bradbury et al., 2018)' but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes In the Four Room experiments, the SFs of each policy were learned similarly as in Alegre et al. (2022a), using Q-learning and 5 Dyna updates per time step. Each policy was trained for 10^6 time steps using a learning rate of 0.1 and epsilon-greedy exploration with a probability of selecting a random action linearly decayed from 1 to 0.05 during half of the training period. ... We used the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 3x10^-4, and mini-batches of size 256.