Prioritized Soft Q-Decomposition for Lexicographic Reinforcement Learning
Authors: Finn Rietz, Erik Schaffernicht, Stefan Heinrich, Johannes A. Stork
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of our approach by presenting successful learning, reuse, and adaptation results for both lowand high-dimensional simulated robot control tasks, as well as offline learning results. In contrast to baseline approaches, PSQD does not trade off between conflicting subtasks or priority constraints and satisfies subtask priorities during learning. |
| Researcher Affiliation | Academia | Finn Rietz Orebro University Sweden Erik Schaffernicht Orebro University Sweden Stefan Heinrich IT University of Copenhagen Denmark Johannes A. Stork Orebro University Sweden |
| Pseudocode | Yes | A pictographic overview of our method as well as pseudocode can be found in supplementary material D. Algorithm 1 Subtask pre-training with SQL, Algorithm 2 Incremental PSQD subtask adaptation. |
| Open Source Code | Yes | A Git Hub repository with the implementation of the algorithm, experiment setup with hyperparameters, and documentation is available here: https://github.com/frietz58/psqd/. The repository provides the complete PSQD implementation and can be used to reproduce the results in this paper. |
| Open Datasets | No | The paper describes using a custom 2D navigation environment and a simulated Franka Emika Panda joint-control task based on the Gymnasium Robotics package. It does not provide access information (link, citation with author/year) for a publicly available dataset used for training. Gymnasium Robotics is a package, not a dataset. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits. It discusses pre-training, zero-shot composition, and adaptation, but without specific percentages or sample counts for data partitioning. |
| Hardware Specification | No | The paper describes simulated environments and control tasks but does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using the "Gymnasium Robotics package" but does not provide specific version numbers for any software dependencies, libraries, or programming languages used in the implementation. |
| Experiment Setup | Yes | We normalize actions to unit length to bound the action space and penalize non-straight actions. The high-priority task r1 corresponds to obstacle avoidance and yields negative rewards in close proximity to the -shaped obstacle (see Fig. 1a) ( σ2 exp( d2 2 l2 ), if d > 0 β σ2 exp( d2 2 l2 ) otherwise, where d is obstacle distance (inferred from s), σ = 1 and l = 1 parameterize a squared exponential kernel, and β = 10 is a an additional punishment for colliding with the obstacle. The auxiliary rewards r2 and r3 respectively yield negative rewards everywhere except in small areas at the top and at the right side of the environment r2(s) = 0 if s.y > 7 δ otherwise, r3(s) = 0 if s.x > 7 δ otherwise, , where we use δ = 5 in all our experiments. |