Calibrated Model-Based Deep Reinforcement Learning
Authors: Ali Malik, Volodymyr Kuleshov, Jiaming Song, Danny Nemer, Harlan Seymour, Stefano Ermon
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our approach on benchmarks for contextual bandits and continuous control (Li et al., 2010; Todorov et al., 2012), as well as on a planning problem in inventory management (Van Roy et al., 1997). Our results show that calibration consistently improves the cumulative reward and the sample complexity of model-based agents, and also enhances their ability to balance exploration and exploitation in contextual bandit settings. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, Stanford University, USA 2Afresh Technologies, San Francisco, USA. |
| Pseudocode | Yes | In Algorithm 1, we present a simple procedure that augments a model-based reinforcement learning algorithm with an extra step that ensures the calibration of its transition model. |
| Open Source Code | Yes | Our code is available at https://github.com/ermongroup/Calibrated Model Based RL |
| Open Datasets | Yes | We evaluate the calibrated version (Cal Lin UCB) and uncalibrated version (Lin UCB) of the Lin UCB algorithm on both synthetic data that satisfies the linearity assumption of the algorithm, as well as on real UCI datasets from Li et al. (2010). We use the Corporacion Favorita Kaggle dataset, which consists of historical sales from a supermarket chain in Ecuador. |
| Dataset Splits | Yes | We experiment on the 100 highest-selling items and use data from 2014-01-01 to 2016-05-31 for training and data from 2016-06-01 to 2016-08-31 for testing. |
| Hardware Specification | No | Insufficient information. The paper does not specify any hardware details such as CPU, GPU, or TPU models used for running the experiments. |
| Software Dependencies | No | Insufficient information. The paper mentions software environments like MuJoCo and OpenAI Gym, and models like Bayesian Dense Net, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | The Bayesian Dense Net has five layers of 128 hidden units with a dropout rate of 0.5 and parametric Re LU nonlinearities. We use variational dropout (Gal & Ghahramani, 2016b) to compute probabilistic forecasts from the model. We follow the training procedure and hyperparameters in Chua et al. (2018), as described in https://github.com/kchua/handful-of-trials. |