Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints

Authors: Shiqing Gao, Jiaxin Ding, Luoyi Fu, Xinbing Wang, Chenghu Zhou

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted, showing that EPO outperforms the baselines in terms of policy performance and constraint satisfaction with a stable training process, particularly on complex tasks.
Researcher Affiliation Academia Shiqing Gao , Jiaxin Ding , Luoyi Fu , Xinbing Wang and Chenghu Zhou Shanghai Jiao Tong University
Pseudocode Yes Algorithm 1 EPO: Exterior Penalty Policy Optimization
Open Source Code No The paper does not contain an explicit statement offering open-source code for the described methodology or a direct link to a code repository.
Open Datasets Yes We train different agents and design comparison experiments in four navigation tasks based on Safety Gymnasium [Brockman et al., 2016] and four Mu Jo Co physical simulator tasks [Todorov et al., 2012].
Dataset Splits No The paper mentions 'training steps' but does not specify exact training, validation, or test dataset splits (e.g., percentages or counts).
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper mentions software components like 'PPO' and 'Mu Jo Co' environments, but it does not specify concrete version numbers for any software dependencies.
Experiment Setup No Algorithm 1 lists hyperparameters that need to be set (e.g., 'PPO clip rate, µ, α for penalty function and learning rate η'), but the paper does not provide the specific numerical values for these hyperparameters or other concrete details about the experimental setup in the main text.