Simultaneously Updating All Persistence Values in Reinforcement Learning

Authors: Luca Sabbioni, Luca Al Daire, Lorenzo Bisi, Alberto Maria Metelli, Marcello Restelli

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental After providing a study on the effects of persistence, we experimentally evaluate our approach in both tabular contexts and more challenging frameworks, including some Atari games.
Researcher Affiliation Collaboration Luca Sabbioni1, Luca Al Daire1, Lorenzo Bisi2, Alberto Maria Metelli1, Marcello Restelli1 1Politecnico di Milano, Milan, Italy 2ML cube, Milan, Italy
Pseudocode Yes Algorithm 1: All Persistence Bellman Update; Algorithm 2: Persistent Q-learning (Per Q-learning); Algorithm 3: Multiple Replay Buffer Storing
Open Source Code No The paper provides a link to its full version on arXiv, but it does not contain an explicit statement about releasing its source code for the described methodology or a direct link to a code repository.
Open Datasets Yes We start with the deterministic 6x10 grid-worlds introduced by Biedenkapp et al. 2021. [...] Moreover, we experiment the 16x16 Frozen Lake, from Open AI Gym benchmark (Brockman et al. 2016), [...] We start with Mountain Car (Moore 1991), [...] The algorithm is then tested in the challenging framework of Atari 2600 games,
Dataset Splits No The paper describes the use of replay buffers for training in reinforcement learning environments but does not specify a static train/validation/test split for reproducibility of data partitioning.
Hardware Specification No The paper does not provide specific details regarding the hardware (e.g., CPU, GPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions using 'Open AI Gym (Brockman et al. 2016) and Baselines (Dhariwal et al. 2017) Python toolkits' but does not specify version numbers for these or other software components.
Experiment Setup Yes The subtransitions are stored in multiple replay buffers Dk, one for each persistence k K. Specifically, Dk stores tuples in the form (s, at, s , r, κ), as summarized in Algorithm 3, where s and s are the first and the last state of the sub-transition, r is the κ-persistent reward, and κ is the true length of the sub-transition, which will then be used to suitably apply b Hκ t . Finally, the gradient update is computed by sampling a mini-batch of experience tuples from each replay buffer Dk, in equal proportion. For a fair comparison with Tempo RL and standard DQN, persistence is implemented on top of the frame skip. Thus, a one-step transition corresponds to 4 frame skips.