Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge

Authors: Meshal Alharbi, Mardavij Roozbehani, Munther Dahleh

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we empirically compare our algorithm with multiple problem-agnostic Q-learning algorithms from the literature. In particular, we compare with the model-free UCB-H method of Jin et al. (2018) and the model-based UCBVI method of Azar, Osband, and Munos (2017). When we test Algorithm 1, we either provide it the true function f (to test ζ = 0) or we corrupt f with random noise (to test ζ > 0) before we give it to the algorithm. Inside a single run, the approximate function ˆf is kept fixed across the time T. Environments. We compare the algorithms on randomly generated MDPs with varying cardinalities. ... Results. In Figure 1, we plot the regret per episode (i.e., the suboptimality gap) for S = 25, A = {2, 4, 8}, H = {5, 10}, W = 5, ζ = {0, 2, 4}, L = 0.25, and K [5000]. Each curve is an average of 50 simulations, and the shadings represent a width of one standard deviation.
Researcher Affiliation Academia Laboratory for Information and Decision Systems, Massachusetts Institute of Technology Cambridge, MA 02139, USA {meshal, mardavij, dahleh}@mit.edu
Pseudocode Yes Algorithm 1: UCB-f Optimistic Q-learning with ˆf
Open Source Code Yes The code to reproduce these results is publicly available5. 5https://github.com/meshal-h/ucb-f
Open Datasets No Environments. We compare the algorithms on randomly generated MDPs with varying cardinalities.
Dataset Splits No The paper states that experiments are conducted on "randomly generated MDPs" and describes the number of episodes (K) and steps (H). However, it does not specify traditional train/validation/test dataset splits as it involves online reinforcement learning where data is generated dynamically.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU specifications, or memory used for running the experiments. It only mentions general experimental setup.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific libraries).
Experiment Setup Yes Bonuses. As multiplicative constants in the bonuses can drastically change empirical performances, we unify the bonus design across the different methods. Specifically, we use the following bonuses in the empirical simulations: N k h(s, a), β2 = c k + cζL (17) UCB-H and UCBVI use β1, and our method uses β2. ... we choose to optimize c by setting it c = 0.05. ... In Figure 1, we plot the regret per episode (i.e., the suboptimality gap) for S = 25, A = {2, 4, 8}, H = {5, 10}, W = 5, ζ = {0, 2, 4}, L = 0.25, and K [5000]. Each curve is an average of 50 simulations...