Provably Efficient Offline Reinforcement Learning with Perturbed Data Sources
Authors: Chengshuai Shi, Wei Xiong, Cong Shen, Jing Yang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Additional experimental results further corroborate the effectiveness of Het PEVI. |
| Researcher Affiliation | Academia | 1Department of Electrical and Computer Engineering, University of Virginia 2Department of Mathematics, The Hong Kong University of Science and Technology 3Department of Electrical Engineering, The Pennsylvania State University. |
| Pseudocode | Yes | Algorithm 1 Het PEVI |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The data source MDPs are randomly generated through a set of independent Dirichlet distributions (Marchal & Arbel, 2017). |
| Dataset Splits | No | The paper describes its simulation setup and data generation process but does not specify training, validation, or test dataset splits. |
| Hardware Specification | No | The paper describes the simulation environment and parameters (e.g., S=2, A=20, H=20 for the target MDP) but does not specify any hardware details such as CPU, GPU, or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions generating data using 'Dirichlet distributions (Marchal & Arbel, 2017)' but does not list specific software dependencies with version numbers (e.g., programming languages, libraries, or frameworks). |
| Experiment Setup | Yes | The target MDP is set to have H = 20 steps, S = 2 states (labelled as {1, 2}), A = 20 actions (labelled as {1, 2, , 20}). The reward and transitions are specified as follows: h [H], rh(s, a) = ( 0.9 if (s, a) = (1, 1) 0.1 otherwise h [H], Ph(1|s, a) = ( 0.9 if (s, a) = (1, 1) 0.5 otherwise. h [H], Ph(1|s, a) = ( 0.1 if (s, a) = (1, 1) 0.5 otherwise. The rewards of the data source are independently sampled from Bernoulli distributions, i.e., rh,l(s, a) Bernoulli(rh(s, a)), while the transitions are independently generated with standard Dirichlet distributions (Marchal & Arbel, 2017). The behavior policy is shared by all data sources, which at each (s, h) S [H], selects action 1 with probability 0.2 and otherwise randomly chooses from other actions. |