User-Interactive Offline Reinforcement Learning

Authors: Phillip Swazinna, Steffen Udluft, Thomas Runkler

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose an algorithm that allows the user to tune this hyperparameter at runtime, thereby addressing both of the above mentioned issues simultaneously. This allows users to start with the original behavior and grant successively greater deviation, as well as stopping at any time when the policy deteriorates or the behavior is too far from the familiar one. ... We show how such an algorithm can be designed, as well as compare its performance with a variety of offline RL baselines and show that a user can achieve state of the art performance with it.
Researcher Affiliation Collaboration Phillip Swazinna Siemens & TU Munich Munich, Germany swazinna@in.tum.de Steffen Udluft Siemens Technology Munich, Germany steffen.udluft@siemens.com Thomas Runkler Siemens & TU Munich Munich, Germany thomas.runkler@siemens.com
Pseudocode Yes Algorithm 1 LION (Training) ... return πθ;
Open Source Code Yes Code will be made available at https://github.com/pswazinna/LION.
Open Datasets Yes Datasets We evaluate LION on the industrial benchmark datasets initially proposed in (Swazinna et al., 2021b). ... The datasets are available at https://github.com/siemens/industrialbenchmark/ tree/offline_datasets/datasets under the Apache License 2.0.
Dataset Splits Yes We train dynamics models using a 90/10 random data split and select the best models according to their validation performance.
Hardware Specification Yes We conducted experiments on a system with a Xeon Gold 5122 CPU (4 × 3.6 GHz, no GPU support used).
Software Dependencies No The paper mentions 'adam (Kingma & Ba, 2014)' as an optimizer but does not specify software or library names with version numbers for reproducibility.
Experiment Setup Yes The recurrent models for the industrial benchmark have an RNN cell with size 30 and an output layer mapping from the cell state to the state space. We use G = 30 history steps to build up the hidden state of the RNN and then predict F = 50 steps into the future. The feedforward models of the 2D env have two layers of size 20 & 10. We use Re LU nonlinearities throughout all experiments. ... We use a discount factor of γ = 0.97 and perform rollouts of length H = 100 to train and evaluate the policy.