Data-Efficient Pipeline for Offline Reinforcement Learning with Limited Data
Authors: Allen Nie, Yannis Flet-Berliac, Deon Jordan, William Steenbergen, Emma Brunskill
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Compared to alternate approaches, our proposed pipeline outputs higher-performing deployed policies from a broad range of offline policy learning algorithms and across various simulation domains in healthcare, education, and robotics. |
| Researcher Affiliation | Academia | Allen Nie Yannis Flet-Berliac Deon R. Jordan William Steenbergen Emma Brunskill Department of Computer Science Stanford University *anie@stanford.edu |
| Pseudocode | Yes | We propose a general pipeline: Split Select Retrain (SSR) (of which we provide a pseudo-code in Algorithm 1, Appendix A.4) |
| Open Source Code | Yes | 3. (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See Supplementary Material. |
| Open Datasets | Yes | We conduct experiments on eight datasets (Figure 2) from five domains (details in Appendix A.14), which we give a short description below, and use as many as 540 candidate AH pairs for the Sepsis POMDP domain. ... D4RL (Fu et al., 2020) is an offline RL standardized benchmark designed and commonly used to evaluate the progress of offline RL algorithms. ... Robomimic (Mandlekar et al., 2021) is composed of various continuous control robotics environments with suboptimal human data. |
| Dataset Splits | Yes | The simplest method to train and verify an algorithm s performance without access to any simulator is to split the data into a train Dtrain and valid set Dvalid. ... First, we split and create different partitions of the input dataset. For each train/validation split, each algorithm-hyperparameter (AH) is trained on the training set and evaluated using the input OPE method to yield an estimated value on the validation set. ... In the proof and our experiments, we focus on when the training and validation sets are of equal size. |
| Hardware Specification | No | The paper does not specify the exact hardware used for experiments (e.g., specific GPU models, CPU types, or detailed cloud instance configurations). While it indicates 'Yes' to question 3(d) in its self-assessment about compute resources, this information is not found within the provided paper text. |
| Software Dependencies | No | The paper mentions various algorithms and methods (e.g., WIS, FQE, BC, CQL) but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments. |
| Experiment Setup | Yes | We experiment with popular offline RL methods (see Table 2 and we provide algorithmic and hyperparameter details in Table A.2). ... Table A.2: Hyperparameters |