Percentile Criterion Optimization in Offline Reinforcement Learning

Authors: Cyrus Cousins, Elita Lobo, Marek Petrik, Yair Zick

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical and empirical results show that our algorithm implicitly constructs much smaller ambiguity sets and learns less conservative robust policies. and Finally, we empirically demonstrate the efficacy of our framework in three domains (Section 5). and Table 1 summarizes the performance of the Va R framework and the baselines for confidence level δ = 0.05
Researcher Affiliation Academia Elita A. Lobo Department of Computer Science University of Massachusetts Amherst elobo@umass.edu Cyrus Cousins Department of Computer Science University of Massachusetts Amherst cbcousins@umass.edu Yair Zick Department of Computer Science University of Massachusetts Amherst yzick@umass.edu Marek Petrik Department of Computer Science University of New Hampshire mpetrik@cs.unh.edu
Pseudocode Yes Algorithm 3.1: Generalized Va R Value Iteration Algorithm
Open Source Code Yes The code and datasets are made available at https://github.com/elitalobo/Va RFramework.git.
Open Datasets No Riverswim: The Riverswim MDP [46] consists of five states and two actions. ... Population Growth Model: The Population Growth MDP [25] models the population growth of pests... Inventory Management: The Inventory Management MDP [56] models the classical inventory management problem... For each domain in our experiments, we sample a dataset D consisting of n tuples of the form {s, a, r, s }, corresponding to the state s, the action taken a, the reward r and the next state s . We construct a posterior distribution over the models using D, assuming Dirichlet priors over the model parameters. Using MCMC sampling, we construct two datasets D1 and D2 containing M and K transition probability models, respectively. We construct L train datasets by randomly sampling 80% of the models from D1 each time. We use D2 as our test dataset. The paper references previous works for the MDP models, but doesn't provide concrete access information (link, DOI, specific dataset file citation) for the actual data used in their experiments, beyond stating they sampled it.
Dataset Splits No We construct L train datasets by randomly sampling 80% of the models from D1 each time. We use D2 as our test dataset. While it describes train/test dataset construction, it does not specify a separate validation split.
Hardware Specification No We concurrently ran all experiments using 120 threads on a CPU swarm cluster with 2GB memory per thread. This mentions a CPU swarm cluster and per-thread memory but lacks specific CPU models, total RAM, or GPU information.
Software Dependencies No The paper does not list specific software dependencies with version numbers.
Experiment Setup Yes Implementation details: For each domain in our experiments, we sample a dataset D consisting of n tuples of the form {s, a, r, s }, corresponding to the state s, the action taken a, the reward r and the next state s . We construct a posterior distribution over the models using D, assuming Dirichlet priors over the model parameters. Using MCMC sampling, we construct two datasets D1 and D2 containing M and K transition probability models, respectively. We construct L train datasets by randomly sampling 80% of the models from D1 each time. We use D2 as our test dataset. and specific hyperparameters like Hyperparameters for Riverswim Domain Number of train models per dataset (M) 500 Number of test models (K) 700 Number of train datasets (L) 10.