Sample Efficient Model-free Reinforcement Learning from LTL Specifications with Optimality Guarantees

Authors: Daqian Shao, Marta Kwiatkowska

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Several experiments on various tabular MDP environments across different LTL tasks demonstrate the improved sample efficiency and optimal policy convergence.
Researcher Affiliation Academia Daqian Shao and Marta Kwiatkowska Department of Computer Science, University of Oxford, UK {daqian.shao, marta.kwiatkowska}@cs.ox.ac.uk
Pseudocode Yes Algorithm 1: KC Q-learning from LTL Algorithm 2: CF+KC Q-learning from LTL
Open Source Code Yes The implementation of our algorithms and experiments can be found on Git Hub: https://github.com/shaodaqian/rl-from-ltl
Open Datasets Yes The second MDP environment is the 8 8 frozen lake environment from Open AI Gym [Brockman et al., 2016].
Dataset Splits No The paper does not provide specific training/test/validation dataset splits (e.g., percentages or sample counts) as is typical for static datasets. As a reinforcement learning paper, it describes training steps and episodes within environments rather than partitioning a fixed dataset.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies No The paper mentions several tools used (PRISM, Rabinizer 4, Open AI Gym) and Q-learning as the core method, but does not provide specific version numbers for the software dependencies used in their experiments.
Experiment Setup Yes We set the learning rate α = 0.1 and ϵ = 0.1 for exploration. We also set a relatively loose upper bound on rewards U = 0.1 and discount factor γ = 0.99 for all experiments to ensure optimality. [...] for experiments we opt for a specific reward function that linearly increases the reward for accepting states as the value of K increases, namely rn = U n/K n [0..K]. The Q function is optimistically initialized by setting the Q value for all available state-action pairs to 2U. All experiments are run 100 times, where we plot the average satisfaction probability with half standard deviation in the shaded area.