The Importance of Non-Markovianity in Maximum State Entropy Exploration

Authors: Mirco Mutti, Riccardo De Santi, Marcello Restelli

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Despite the hardness result of Theorem 5.4, we provide a brief numerical validation around the potential of non Markovianity in MSE exploration. Crucially, the reported analysis is limited to simple domains and short time horizons, and it has to be intended as an illustration of the theoretical claims reported in previous sections. For the sake of simplicity, in this analysis we consider stationary policies for the Markovian set, though similar results can be obtained for time-variant strategies as well (in stochastic environments). Whereas a comprehensive evaluation of the practical benefits of non-Markovianity in MSE exploration is left as future work, we discuss in Section 7 why we believe that the development of scalable methods is not hopeless even in this challenging setting. In this section, we consider a 3State (S = 3, A = 2, T = 9), which is a simple abstraction of the two-rooms in Figure 1, and a River Swim (Strehl & Littman, 2008) (S = 3, A = 2, T = 10) that are depicted in Figure 2a, 2d respectively. Especially, we compare the expected entropy (2) achieved by an optimal non-Markovian policy πNM arg maxπ ΠNM E(π), which is obtained by solving the extended MDP as described in the proof of Lemma 4.4, against an optimal Markovian policy πM arg maxπ ΠM E(π). In confirmation of the result in Theorem 4.2, πM cannot match the performance of πNM (see Figure 2b, 2e). In 3State, an optimal strategy requires going left when arriving in state 0 from state 2 and vice versa. The policy πNM is able to do that, and it always realizes the optimal trajectory (Figure 2c). Instead, πM is uniform in 0 and it often runs into sub-optimal trajectories. In the River Swim, the main hurdle is to reach state 2 from the initial one. Whereas πM and πNM are equivalently good in doing so, as reported in Figure 2f, only the non-Markovian strategy is able to balance the visitations in the previous states when it eventually reaches 2. The difference is already noticeable with a short horizon and it would further increase with a longer T.
Researcher Affiliation Academia 1Politecnico di Milano 2Universit a di Bologna 3ETH Zurich.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described. There are no links or explicit statements about code release.
Open Datasets Yes In this section, we consider a 3State (S = 3, A = 2, T = 9), which is a simple abstraction of the two-rooms in Figure 1, and a River Swim (Strehl & Littman, 2008) (S = 3, A = 2, T = 10) that are depicted in Figure 2a, 2d respectively.
Dataset Splits No The paper performs numerical validation but does not provide specific dataset split information (e.g., percentages, sample counts, or citations to predefined splits) for training, validation, or testing.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., programming language versions, library versions, or solver versions).
Experiment Setup Yes In this section, we consider a 3State (S = 3, A = 2, T = 9), which is a simple abstraction of the two-rooms in Figure 1, and a River Swim (Strehl & Littman, 2008) (S = 3, A = 2, T = 10) that are depicted in Figure 2a, 2d respectively. Especially, we compare the expected entropy (2) achieved by an optimal non-Markovian policy πNM arg maxπ ΠNM E(π), which is obtained by solving the extended MDP as described in the proof of Lemma 4.4, against an optimal Markovian policy πM arg maxπ ΠM E(π)... We provide 95% c.i. over 100 runs.