Compressed Conditional Mean Embeddings for Model-Based Reinforcement Learning
Authors: Guy Lever, John Shawe-Taylor, Ronnie Stafford, Csaba Szepesvari
AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We performed 3 online model-based policy iteration experiments. We report mean results over 10 experiments, and standard error. |
| Researcher Affiliation | Academia | Guy Lever University College London London, UK John Shawe-Taylor University College London London, UK Ronnie Stafford University College London London, UK Csaba Szepesv ari University of Alberta Edmonton, Canada |
| Pseudocode | Yes | Algorithm 1 Generic model-based policy iteration with CMEs; Algorithm 3 Policy iteration with compressed kernel CMEs |
| Open Source Code | Yes | Lever, G.; Shawe-Taylor, J.; Stafford, R.; and Szepesv ari, C. 2016. Compressed conditional mean embeddings for modelbased reinforcement learning (supplementary material). In http://www0.cs.ucl.ac.uk/staff/G.Lever/pubs/CME4RLSupp.pdf. |
| Open Datasets | No | The paper describes experiments on benchmark MDPs (cart-pole, mountain-car) and a simulated Quadrocopter navigation task using a cited simulator (De Nardi 2013). Data is generated by interaction with these environments rather than using a pre-existing publicly available dataset, and no access information for collected data is provided. |
| Dataset Splits | Yes | For all 3 methods we performed 5-fold cross-validation over 10 bandwidth parameters to optimize the input kernel K on S A. |
| Hardware Specification | No | Experiments are run on a cluster of single core processors. This statement is too general and does not provide specific model numbers or detailed specifications. |
| Software Dependencies | No | The paper mentions using a 'simulator (De Nardi 2013)' and algorithms like 'Lasso (Tibshirani 1996)' but does not specify any software names with version numbers for reproducibility. |
| Experiment Setup | Yes | The horizon of each MDP is 100, so that nnew = 200 data points were added at each iteration. To perform planning at each iteration we performed J = 10 policy evaluation/improvement steps, before returning to the MDP to collect more data. For the compressed CME the size of the sparse-greedy feature space was constrained to be no greater than d = 200. For all 3 methods we performed 5-fold cross-validation over 10 bandwidth parameters to optimize the input kernel K on S A. For the two least-squares methods we also cross-validated the regularization parameter over 20 values. The output feature map corresponds to a Gaussian kernel φ(s) = L(s, ), and the bandwidth of L was chosen using an informal search for each MDP (it is not clear how this parameter can be validated). For planning we set γ = 0.98, but we report results for γ = 0.99. The state kernel is a Gaussian L(s, s ) = exp 1 2σ2 S (s s ) MS(s s ) with MS = diag(1, 1/4), and the stateaction kernel is K((s, a), (s , a )) = exp 1 2σ2 S A ((s, a) (s , a )) MS A((s, a) (s , a )) with MS A = diag(1, 1/4, 1/10000). The output feature map φ(s) = L(s, ) with σS = 0.5 (chosen by informal search). During model learning we performed 5-fold cross validation over a range of 10 bandwidths in the range [0.01, 5] to optimize σS A. For the compressed CME the tolerance of the compression set was set to δ = 0.1, i.e. we use a δ-lossy compression set C with δ = 0.1. |