Percentile Criterion Optimization in Offline Reinforcement Learning
Authors: Cyrus Cousins, Elita Lobo, Marek Petrik, Yair Zick
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical and empirical results show that our algorithm implicitly constructs much smaller ambiguity sets and learns less conservative robust policies. and Finally, we empirically demonstrate the efficacy of our framework in three domains (Section 5). and Table 1 summarizes the performance of the Va R framework and the baselines for confidence level δ = 0.05 |
| Researcher Affiliation | Academia | Elita A. Lobo Department of Computer Science University of Massachusetts Amherst elobo@umass.edu Cyrus Cousins Department of Computer Science University of Massachusetts Amherst cbcousins@umass.edu Yair Zick Department of Computer Science University of Massachusetts Amherst yzick@umass.edu Marek Petrik Department of Computer Science University of New Hampshire mpetrik@cs.unh.edu |
| Pseudocode | Yes | Algorithm 3.1: Generalized Va R Value Iteration Algorithm |
| Open Source Code | Yes | The code and datasets are made available at https://github.com/elitalobo/Va RFramework.git. |
| Open Datasets | No | Riverswim: The Riverswim MDP [46] consists of five states and two actions. ... Population Growth Model: The Population Growth MDP [25] models the population growth of pests... Inventory Management: The Inventory Management MDP [56] models the classical inventory management problem... For each domain in our experiments, we sample a dataset D consisting of n tuples of the form {s, a, r, s }, corresponding to the state s, the action taken a, the reward r and the next state s . We construct a posterior distribution over the models using D, assuming Dirichlet priors over the model parameters. Using MCMC sampling, we construct two datasets D1 and D2 containing M and K transition probability models, respectively. We construct L train datasets by randomly sampling 80% of the models from D1 each time. We use D2 as our test dataset. The paper references previous works for the MDP models, but doesn't provide concrete access information (link, DOI, specific dataset file citation) for the actual data used in their experiments, beyond stating they sampled it. |
| Dataset Splits | No | We construct L train datasets by randomly sampling 80% of the models from D1 each time. We use D2 as our test dataset. While it describes train/test dataset construction, it does not specify a separate validation split. |
| Hardware Specification | No | We concurrently ran all experiments using 120 threads on a CPU swarm cluster with 2GB memory per thread. This mentions a CPU swarm cluster and per-thread memory but lacks specific CPU models, total RAM, or GPU information. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers. |
| Experiment Setup | Yes | Implementation details: For each domain in our experiments, we sample a dataset D consisting of n tuples of the form {s, a, r, s }, corresponding to the state s, the action taken a, the reward r and the next state s . We construct a posterior distribution over the models using D, assuming Dirichlet priors over the model parameters. Using MCMC sampling, we construct two datasets D1 and D2 containing M and K transition probability models, respectively. We construct L train datasets by randomly sampling 80% of the models from D1 each time. We use D2 as our test dataset. and specific hyperparameters like Hyperparameters for Riverswim Domain Number of train models per dataset (M) 500 Number of test models (K) 700 Number of train datasets (L) 10. |