Off-Policy Evaluation via Off-Policy Classification
Authors: Alexander Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, Sergey Levine
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally show that this metric outperforms baselines on a number of tasks. |
| Researcher Affiliation | Collaboration | 1Google Brain, Mountain View, USA 2Deep Mind, London, UK 3University of California Berkeley, Berkeley, USA |
| Pseudocode | Yes | Pseudocode is in Appendix B. |
| Open Source Code | No | The paper states "Code for the binary tree environment is available at https://bit.ly/2Qx6TJ7.", but does not explicitly state that the code for the main methodology (OPC/Soft OPC) is open-source or provided. |
| Open Datasets | No | The paper describes collecting its own datasets for the robotic grasping task ('data collected by a hand-crafted policy... with two different datasets') and for the Binary Tree and Pong experiments ('generated 1,000 episodes from a uniformly random policy', 'generated 30 episodes from each'), but does not provide concrete access information (link, DOI, or explicit statement of public availability) for these collected datasets. |
| Dataset Splits | Yes | For the validation dataset D was collected by generating 1,000 episodes from a uniformly random policy (Binary Tree). For the validation dataset we used 38 Q-functions that were partially-trained with DDQN and generated 30 episodes from each, for a total of 1140 episodes (Pong). ... based on held-out validation sets of 50, 000 episodes from the training environment and 10, 000 episodes from the test one (Robotic Grasping). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions algorithms like DQN, DDQN, and QT-Opt, but does not provide specific software dependencies with version numbers (e.g., programming languages, libraries, or frameworks with their versions) used in the experiments. |
| Experiment Setup | Yes | We learned Q-functions using DQN [25] and DDQN [38], varying hyperparameters such as the learning rate, the discount factor γ, and the batch size, as discussed in detail in Appendix E.2. Appendix E.2 provides details such as learning rate of 0.0000625, a discount factor γ of 0.99, and a batch size of 32. |