From Importance Sampling to Doubly Robust Policy Gradient
Authors: Jiawei Huang, Nan Jiang
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we empirically validate the effectiveness of DR-PG. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Illinois Urbana-Champaign. Correspondence to: Nan Jiang <nanjiang@illinois.edu>. |
| Pseudocode | No | The paper contains mathematical derivations and proofs but no structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We use the Cart Pole Environment in Open AI Gym (Brockman et al., 2016) with DART physics engine (Lee et al., 2018), and set the horizon length to be 1000. |
| Dataset Splits | No | The paper mentions training an agent but does not provide specific details on dataset splits (percentages, counts, or explicit splitting methodology) for training, validation, or testing. |
| Hardware Specification | No | The paper mentions 'CPU/GPU usage' in its computational cost section but does not specify any particular hardware models (e.g., specific GPU or CPU models) used for the experiments. |
| Software Dependencies | No | The paper mentions 'Open AI Gym' and 'DART physics engine' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We follow Cheng et al. (2019) for the choice of neural network architecture and training methods in building the policy π, value function estimator e V , and dynamics model ed. We choose nq = 20, nv = 20, L = 30 and γ = 0.9 in the actual experiments. For each state st we use Monte Carlo (1000 samples) to compute the expectation (over at) of Q(st, at) log π(st , at ). |