Lifelong Hyper-Policy Optimization with Multiple Importance Sampling Regularization

Authors: Pierre Liotet, Francesco Vidaich, Alberto Maria Metelli, Marcello Restelli7525-7533

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we empirically validate our approach, in comparison with state-of-the-art algorithms, on realistic environments, including water resource management and trading. and After having revised the literature (Section 5), we provide an experimental evaluation on realistic domains, including a trading and water resource management, in comparison with state-of-the-art baselines (Section 6).
Researcher Affiliation Academia Pierre Liotet1, Francesco Vidaich2, Alberto Maria Metelli1, Marcello Restelli1 1Politecnico di Milano 2University of Padova
Pseudocode Yes Algorithm 1: Lifelong learning with POLIS
Open Source Code Yes The code is available at https://github.com/pierresdr/polis.
Open Datasets No We consider three datasets of historical data, 2009-2012, 2013-2016, and 2017-2020; each period having a little more than 1000 data points. and The inflow (e.g., rain) is the non-stationary process and the agent has obviously no impact on it, thus satisfying assumption 6.1. The mean inflow follows one of either 3 profiles given in Appendix C.2. (The paper describes the data sources but does not provide concrete access information such as a direct link, DOI, or formal citation to a publicly available version of the exact datasets used.)
Dataset Splits Yes In the first, we select the best performing hyperparameters from the dataset 2009-2012 and evaluate the selection on the other two datasets. In the second approach, we both select the hyperparameters and evaluate on the last two datasets.
Hardware Specification No The paper does not provide any specific hardware details for the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers.
Experiment Setup Yes For all tasks, we set γ = ω = 1.3 We consider a particular subclass of non-stationary environments, frequently encountered in practice. and α is set to 500 and we consider a target period of 500 steps. and α is set to 1000 in order to include enough years of past data in the estimator. We provide results for a target period of 500 steps. and but is now training its hyper-policy every few steps (50 in all experiments) for a given number of gradient steps (100 in all experiments).