Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes
Authors: Taylor W. Killian, Samuel Daulton, George Konidaris, Finale Doshi-Velez
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments and Results. For each of these domains, we compare our formulation of the Hi P-MDP with embedded latent parameters (equation 2) with four baselines (one model-free and three model-based) to demonstrate the efficiency of learning a policy for a new instance b using the Hi P-MDP. These comparisons are made across the first handful of episodes encountered in a new task instance to highlight the advantage provided by transferring information through the Hi P-MDP. The Hi P-MDP with embedded wb outperforms all four benchmarks. |
| Researcher Affiliation | Collaboration | Taylor Killian taylorkillian@g.harvard.edu Harvard University Samuel Daulton sdaulton@g.harvard.edu Harvard University, Facebook George Konidaris gdk@cs.brown.edu Brown University Finale Doshi-Velez finale@seas.harvard.edu Harvard University |
| Pseudocode | Yes | Algorithm 1 Learning a control policy w/ the Hi P-MDP |
| Open Source Code | Yes | Example code for training and evaluating a Hi P-MDP, including the simulators used in this section, can be found at http://github.com/dtak/hip-mdp-public. |
| Open Datasets | Yes | We revisit the 2D demonstration problem from Section 3, as well as describe results on both the acrobot [42] and a more complex healthcare domain: prescribing effective HIV treatments [15] to patients with varying physiologies. (Acrobot [42] refers to "R Sutton and A Barto. Reinforcement learning: an introduction, volume 1. MIT Press, Cambridge, 1998." and HIV [15] refers to "D Ernst, G Stan, J Goncalves, and L Wehenkel. Clinical data based optimal STI strategies for HIV: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control, 2006." both are standard academic benchmarks/problems). |
| Dataset Splits | No | The paper describes the training process (e.g., "trained on observations from a single episode", use of "global replay buffer D" and "instance-specific replay buffer Db") and update procedures, but it does not specify explicit numerical percentages or counts for training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not specify any hardware details, such as CPU models, GPU types, or memory specifications, used for running the experiments. |
| Software Dependencies | No | The paper mentions software components and algorithms like Bayesian Neural Networks (BNNs), Adam optimizer, and Double Deep Q Networks (DDQNs), but it does not provide specific version numbers for these or any other software libraries or programming languages used. |
| Experiment Setup | Yes | Specific modeling details such as number of epochs, learning rates, etc. are described in Appendix C. |