Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge
Authors: Meshal Alharbi, Mardavij Roozbehani, Munther Dahleh
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we empirically compare our algorithm with multiple problem-agnostic Q-learning algorithms from the literature. In particular, we compare with the model-free UCB-H method of Jin et al. (2018) and the model-based UCBVI method of Azar, Osband, and Munos (2017). When we test Algorithm 1, we either provide it the true function f (to test ζ = 0) or we corrupt f with random noise (to test ζ > 0) before we give it to the algorithm. Inside a single run, the approximate function ˆf is kept fixed across the time T. Environments. We compare the algorithms on randomly generated MDPs with varying cardinalities. ... Results. In Figure 1, we plot the regret per episode (i.e., the suboptimality gap) for S = 25, A = {2, 4, 8}, H = {5, 10}, W = 5, ζ = {0, 2, 4}, L = 0.25, and K [5000]. Each curve is an average of 50 simulations, and the shadings represent a width of one standard deviation. |
| Researcher Affiliation | Academia | Laboratory for Information and Decision Systems, Massachusetts Institute of Technology Cambridge, MA 02139, USA {meshal, mardavij, dahleh}@mit.edu |
| Pseudocode | Yes | Algorithm 1: UCB-f Optimistic Q-learning with ˆf |
| Open Source Code | Yes | The code to reproduce these results is publicly available5. 5https://github.com/meshal-h/ucb-f |
| Open Datasets | No | Environments. We compare the algorithms on randomly generated MDPs with varying cardinalities. |
| Dataset Splits | No | The paper states that experiments are conducted on "randomly generated MDPs" and describes the number of episodes (K) and steps (H). However, it does not specify traditional train/validation/test dataset splits as it involves online reinforcement learning where data is generated dynamically. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU specifications, or memory used for running the experiments. It only mentions general experimental setup. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific libraries). |
| Experiment Setup | Yes | Bonuses. As multiplicative constants in the bonuses can drastically change empirical performances, we unify the bonus design across the different methods. Specifically, we use the following bonuses in the empirical simulations: N k h(s, a), β2 = c k + cζL (17) UCB-H and UCBVI use β1, and our method uses β2. ... we choose to optimize c by setting it c = 0.05. ... In Figure 1, we plot the regret per episode (i.e., the suboptimality gap) for S = 25, A = {2, 4, 8}, H = {5, 10}, W = 5, ζ = {0, 2, 4}, L = 0.25, and K [5000]. Each curve is an average of 50 simulations... |