Compressed Conditional Mean Embeddings for Model-Based Reinforcement Learning

Authors: Guy Lever, John Shawe-Taylor, Ronnie Stafford, Csaba Szepesvari

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We performed 3 online model-based policy iteration experiments. We report mean results over 10 experiments, and standard error.
Researcher Affiliation Academia Guy Lever University College London London, UK John Shawe-Taylor University College London London, UK Ronnie Stafford University College London London, UK Csaba Szepesv ari University of Alberta Edmonton, Canada
Pseudocode Yes Algorithm 1 Generic model-based policy iteration with CMEs; Algorithm 3 Policy iteration with compressed kernel CMEs
Open Source Code Yes Lever, G.; Shawe-Taylor, J.; Stafford, R.; and Szepesv ari, C. 2016. Compressed conditional mean embeddings for modelbased reinforcement learning (supplementary material). In http://www0.cs.ucl.ac.uk/staff/G.Lever/pubs/CME4RLSupp.pdf.
Open Datasets No The paper describes experiments on benchmark MDPs (cart-pole, mountain-car) and a simulated Quadrocopter navigation task using a cited simulator (De Nardi 2013). Data is generated by interaction with these environments rather than using a pre-existing publicly available dataset, and no access information for collected data is provided.
Dataset Splits Yes For all 3 methods we performed 5-fold cross-validation over 10 bandwidth parameters to optimize the input kernel K on S A.
Hardware Specification No Experiments are run on a cluster of single core processors. This statement is too general and does not provide specific model numbers or detailed specifications.
Software Dependencies No The paper mentions using a 'simulator (De Nardi 2013)' and algorithms like 'Lasso (Tibshirani 1996)' but does not specify any software names with version numbers for reproducibility.
Experiment Setup Yes The horizon of each MDP is 100, so that nnew = 200 data points were added at each iteration. To perform planning at each iteration we performed J = 10 policy evaluation/improvement steps, before returning to the MDP to collect more data. For the compressed CME the size of the sparse-greedy feature space was constrained to be no greater than d = 200. For all 3 methods we performed 5-fold cross-validation over 10 bandwidth parameters to optimize the input kernel K on S A. For the two least-squares methods we also cross-validated the regularization parameter over 20 values. The output feature map corresponds to a Gaussian kernel φ(s) = L(s, ), and the bandwidth of L was chosen using an informal search for each MDP (it is not clear how this parameter can be validated). For planning we set γ = 0.98, but we report results for γ = 0.99. The state kernel is a Gaussian L(s, s ) = exp 1 2σ2 S (s s ) MS(s s ) with MS = diag(1, 1/4), and the stateaction kernel is K((s, a), (s , a )) = exp 1 2σ2 S A ((s, a) (s , a )) MS A((s, a) (s , a )) with MS A = diag(1, 1/4, 1/10000). The output feature map φ(s) = L(s, ) with σS = 0.5 (chosen by informal search). During model learning we performed 5-fold cross validation over a range of 10 bandwidths in the range [0.01, 5] to optimize σS A. For the compressed CME the tolerance of the compression set was set to δ = 0.1, i.e. we use a δ-lossy compression set C with δ = 0.1.