Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning
Authors: Ali Mousavi, Lihong Li, Qiang Liu, Denny Zhou
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on benchmarks verify the effectiveness of our approach. |
| Researcher Affiliation | Collaboration | Google Research {alimous,lihong,dennyzhou}@google.com Qiang Liu Dengyong Zhou University of Texas, Austin lqiang@cs.utexas.edu |
| Pseudocode | Yes | E PSEUDO-CODE OF ALGORITHM This section includes the pseudo-code of our algorithm that we described in Section 4. Algorithm 1 Black-box Off-policy Estimator based on MMD |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | We now focus on four classic control problems... Pendulum... Mountain Car... Cartpole... Acrobot. These are well-known, publicly available reinforcement learning environments. |
| Dataset Splits | No | The paper describes how samples are generated ("trajectories") and used for estimation, and mentions Monte-Carlo samples for reporting results, but does not specify explicit training, validation, and test dataset splits with percentages or counts. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware (e.g., GPU/CPU models, memory specifications) used to run the experiments. |
| Software Dependencies | No | The paper mentions using a "3-layer (...) feed-forward neural network with the sigmoid activation function" but does not specify any software libraries, frameworks, or their version numbers that were used for implementation. |
| Experiment Setup | Yes | For each environment, we train a near-optimal policy π+ using the Neural Fitted Q Iteration algorithm (Riedmiller, 2005). We then set the behavior and target policies as πBEH = α1π+ + (1 α1)π and π = α2π+ + (1 α2)π , where π denotes a random policy, and 0 α1, α2 1 are two constant values making the behavior policy distinct from the target policy. In our experiments, we set α1 = 0.7 and α2 = 0.9. In all the cases, we use a 3-layer (having 30, 20, and 10 hidden neurons) feed-forward neural network with the sigmoid activation function as our parametric model in equation 8. |