Sample Efficient Model-free Reinforcement Learning from LTL Specifications with Optimality Guarantees
Authors: Daqian Shao, Marta Kwiatkowska
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Several experiments on various tabular MDP environments across different LTL tasks demonstrate the improved sample efficiency and optimal policy convergence. |
| Researcher Affiliation | Academia | Daqian Shao and Marta Kwiatkowska Department of Computer Science, University of Oxford, UK {daqian.shao, marta.kwiatkowska}@cs.ox.ac.uk |
| Pseudocode | Yes | Algorithm 1: KC Q-learning from LTL Algorithm 2: CF+KC Q-learning from LTL |
| Open Source Code | Yes | The implementation of our algorithms and experiments can be found on Git Hub: https://github.com/shaodaqian/rl-from-ltl |
| Open Datasets | Yes | The second MDP environment is the 8 8 frozen lake environment from Open AI Gym [Brockman et al., 2016]. |
| Dataset Splits | No | The paper does not provide specific training/test/validation dataset splits (e.g., percentages or sample counts) as is typical for static datasets. As a reinforcement learning paper, it describes training steps and episodes within environments rather than partitioning a fixed dataset. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions several tools used (PRISM, Rabinizer 4, Open AI Gym) and Q-learning as the core method, but does not provide specific version numbers for the software dependencies used in their experiments. |
| Experiment Setup | Yes | We set the learning rate α = 0.1 and ϵ = 0.1 for exploration. We also set a relatively loose upper bound on rewards U = 0.1 and discount factor γ = 0.99 for all experiments to ensure optimality. [...] for experiments we opt for a specific reward function that linearly increases the reward for accepting states as the value of K increases, namely rn = U n/K n [0..K]. The Q function is optimistically initialized by setting the Q value for all available state-action pairs to 2U. All experiments are run 100 times, where we plot the average satisfaction probability with half standard deviation in the shaded area. |