Direct Policy Iteration with Demonstrations
Authors: Jessica Chemali, Alessandro Lazaric
IJCAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we report an empirical evaluation of the algorithm and a comparison with the state-of-the-art algorithms. |
| Researcher Affiliation | Academia | Jessica Chemali Alessandro Lazaric Machine Learning Department Seque L team Carnegie Mellon University INRIA Lille |
| Pseudocode | Yes | Algorithm 1 Direct Policy Iteration with Demonstrations |
| Open Source Code | Yes | The implementation of DPID is available at https://www.dropbox.com/s/jj4g9ndonol4aoy/dpid.zip?dl=0 |
| Open Datasets | No | The paper uses the Garnet framework to generate random finite MDPs and describes the Vehicle Brake Control domain, which are simulated environments, but does not provide access to specific datasets used for experiments. |
| Dataset Splits | No | The paper does not explicitly provide details on how the data was split into training, validation, and test sets for the experiments. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware used for running the experiments (e.g., CPU, GPU models, or memory specifications). |
| Software Dependencies | No | The paper mentions software like CVX and LIBSVM but does not provide specific version numbers for these or any other ancillary software dependencies used in the experiments. |
| Experiment Setup | Yes | We use Ns = 15, Na = 3, Nb(s) [3, 6] and we estimate Erralg over 100 independent runs, while error bars are computed as 95% Gaussian confidence intervals. Fixed NE = 15 optimal expert demonstrations and increasing NRL by 50 at each iteration, starting with NRL = 50 NE (NE = 0 for DPI). In Fig.1B and C, we replace the optimal expert with a suboptimal one by sampling random non-optimal actions 25% and 50% of the time respectively. We now introduce an approximate representation such that for every state, we construct a binary feature vector of length d = 6 < Ns. The number of ones in the representation is set to l = 3 and their locations are chosen randomly as in [Bhatnagar et al., 2009]. |