Pareto Policy Pool for Model-based Offline Reinforcement Learning

Authors: Yijun Yang, Jing Jiang, Tianyi Zhou, Jie Ma, Yuhui Shi

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the D4RL benchmark for offline RL, P3 substantially outperforms several recent baseline methods over multiple tasks, especially when the quality of pre-collected experiences is low. ... This section aims to answer the following questions by evaluating P3 with other offline RL methods on the datasets from the D4RL Gym benchmark (Fu et al., 2020).
Researcher Affiliation Academia 1Australian Artificial Intelligence Institute, University of Technology Sydney 2University of Washington, Seattle, 3University of Maryland, College Park 4Department of Computer Science and Engineering, Southern University of Science and Technology
Pseudocode Yes Alg. 1 Pareto policy pool (P3) for model-based offline RL; Alg. 2 A two-stage method for solving constrained bi-objective optimization; Algorithm 3 Fitted Q evaluation (FQE) for Pareto policy selection
Open Source Code Yes Code is available at https://github.com/Over Euro/P3.
Open Datasets Yes We evaluate P3 and compare it with several state-of-the-art offline RL methods on the standard D4RL Gym benchmark (Fu et al., 2020).
Dataset Splits Yes We train an ensemble of N models and pick the best K models based on their prediction error on a hold-out set. ... D4RL Gym Datasets. D4RL is a widely-used benchmark for evaluating offline RL algorithms. It provides a variety of environments, tasks, and corresponding datasets...
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as CPU models, GPU models, or cloud computing instances.
Software Dependencies No The paper mentions software components like 'MLP', 'Adam', and 'Open AI s ES' but does not specify their version numbers or other library dependencies with versions needed for reproduction.
Experiment Setup Yes Table 3: Hyperparameters of environment model for D4RL Gym experiments. (e.g., Number of models/elites 7/5, Learning rate 10-4). Table 4: Hyperparameters of P3 for D4RL Gym experiments. (e.g., Policy network MLP(32, 32), Horizon length H 1000, Number of reference vectors n 5).