Direct Policy Iteration with Demonstrations

Authors: Jessica Chemali, Alessandro Lazaric

IJCAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we report an empirical evaluation of the algorithm and a comparison with the state-of-the-art algorithms.
Researcher Affiliation Academia Jessica Chemali Alessandro Lazaric Machine Learning Department Seque L team Carnegie Mellon University INRIA Lille
Pseudocode Yes Algorithm 1 Direct Policy Iteration with Demonstrations
Open Source Code Yes The implementation of DPID is available at https://www.dropbox.com/s/jj4g9ndonol4aoy/dpid.zip?dl=0
Open Datasets No The paper uses the Garnet framework to generate random finite MDPs and describes the Vehicle Brake Control domain, which are simulated environments, but does not provide access to specific datasets used for experiments.
Dataset Splits No The paper does not explicitly provide details on how the data was split into training, validation, and test sets for the experiments.
Hardware Specification No The paper does not provide specific details regarding the hardware used for running the experiments (e.g., CPU, GPU models, or memory specifications).
Software Dependencies No The paper mentions software like CVX and LIBSVM but does not provide specific version numbers for these or any other ancillary software dependencies used in the experiments.
Experiment Setup Yes We use Ns = 15, Na = 3, Nb(s) [3, 6] and we estimate Erralg over 100 independent runs, while error bars are computed as 95% Gaussian confidence intervals. Fixed NE = 15 optimal expert demonstrations and increasing NRL by 50 at each iteration, starting with NRL = 50 NE (NE = 0 for DPI). In Fig.1B and C, we replace the optimal expert with a suboptimal one by sampling random non-optimal actions 25% and 50% of the time respectively. We now introduce an approximate representation such that for every state, we construct a binary feature vector of length d = 6 < Ns. The number of ones in the representation is set to l = 3 and their locations are chosen randomly as in [Bhatnagar et al., 2009].