Fast Rates for Maximum Entropy Exploration

Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Pierre Perrault, Yunhao Tang, Michal Valko, Pierre Menard

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we report experimental results on simple tabular MDP for presented algorithms and show the difference between visitation and trajectory entropies. In particular, we compare Ent Game and UCBVI-Ent algorithms with (a) random agent that takes all actions uniformly at random, (b) an optimal MVEE policy computed by solving the convex program, and (c) an optimal MTEE policy computed by solving the regularized Bellman equations.In Figure 1 we present the number of state visits for our algorithms and baselines during N = 100000 interactions with the environment.
Researcher Affiliation Collaboration 1HSE University 2Artificial Intelligence Research Institute 3Duisburg-Essen University 4Google Deep Mind 5École Polytechnique 6Mohamed Bin Zayed University of AI 7IDEMIA 8ENS Lyon.
Pseudocode Yes Algorithm 1 Ent Game, Algorithm 2 RL-Explore-Ent, Algorithm 3 UCBVI-Ent, Algorithm 4 Reg Ent Game
Open Source Code Yes The code for experiments could be found by the following link: https://github.com/d-tiapkin/max-entropy-exploration.
Open Datasets Yes We choose a stochastic environment called Double Chain as considered by Kaufmann et al. (2021). As an additional experiment to verify our findings we perform additional experiments on the environment Grid World as it presented in Figure 4.
Dataset Splits No The paper describes simulation environments ('Double Chain', 'Grid World') and the total number of samples/interactions (e.g., 'N = 100000 samples') for learning and evaluation, but it does not specify explicit training, validation, or test dataset splits in percentages or counts.
Hardware Specification No The paper does not specify any hardware details such as GPU or CPU models used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers).
Experiment Setup Yes We choose a stochastic environment called Double Chain as considered by Kaufmann et al. (2021). Since the transition kernel for this environment is stage-homogeneous... In Figure 1 we present the number of state visits for our algorithms and baselines during N = 100000 interactions with the environment. For UCBVI-Ent algorithm the procedure was separated on two stages: at first we learn MDP with N-sample budget and extract the final policy, and then plot the number of visits for the final policy during another N samples. The state space is a set of discrete points in a 21x21 grid. For each state there are 4 possible actions: left, right, up or down, and for each action there is a 5% probability to move to the wrong direction. The initial state s1 is the middle of the grid. For this experiment we use N = 60000 samples and report the average over 12 random seeds.