Prioritized Soft Q-Decomposition for Lexicographic Reinforcement Learning

Authors: Finn Rietz, Erik Schaffernicht, Stefan Heinrich, Johannes A. Stork

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the efficacy of our approach by presenting successful learning, reuse, and adaptation results for both lowand high-dimensional simulated robot control tasks, as well as offline learning results. In contrast to baseline approaches, PSQD does not trade off between conflicting subtasks or priority constraints and satisfies subtask priorities during learning.
Researcher Affiliation Academia Finn Rietz Orebro University Sweden Erik Schaffernicht Orebro University Sweden Stefan Heinrich IT University of Copenhagen Denmark Johannes A. Stork Orebro University Sweden
Pseudocode Yes A pictographic overview of our method as well as pseudocode can be found in supplementary material D. Algorithm 1 Subtask pre-training with SQL, Algorithm 2 Incremental PSQD subtask adaptation.
Open Source Code Yes A Git Hub repository with the implementation of the algorithm, experiment setup with hyperparameters, and documentation is available here: https://github.com/frietz58/psqd/. The repository provides the complete PSQD implementation and can be used to reproduce the results in this paper.
Open Datasets No The paper describes using a custom 2D navigation environment and a simulated Franka Emika Panda joint-control task based on the Gymnasium Robotics package. It does not provide access information (link, citation with author/year) for a publicly available dataset used for training. Gymnasium Robotics is a package, not a dataset.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits. It discusses pre-training, zero-shot composition, and adaptation, but without specific percentages or sample counts for data partitioning.
Hardware Specification No The paper describes simulated environments and control tasks but does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using the "Gymnasium Robotics package" but does not provide specific version numbers for any software dependencies, libraries, or programming languages used in the implementation.
Experiment Setup Yes We normalize actions to unit length to bound the action space and penalize non-straight actions. The high-priority task r1 corresponds to obstacle avoidance and yields negative rewards in close proximity to the -shaped obstacle (see Fig. 1a) ( σ2 exp( d2 2 l2 ), if d > 0 β σ2 exp( d2 2 l2 ) otherwise, where d is obstacle distance (inferred from s), σ = 1 and l = 1 parameterize a squared exponential kernel, and β = 10 is a an additional punishment for colliding with the obstacle. The auxiliary rewards r2 and r3 respectively yield negative rewards everywhere except in small areas at the top and at the right side of the environment r2(s) = 0 if s.y > 7 δ otherwise, r3(s) = 0 if s.x > 7 δ otherwise, , where we use δ = 5 in all our experiments.