Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CHPO: Constrained Hybrid-action Policy Optimization for Reinforcement Learning
Authors: ao zhou, Jiayi Guan, Li Shen, Fan Lu, Sanqing Qu, Junqiao Zhao, Ziqiao Wang, Ya Wu, Guang Chen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, extensive experiments demonstrate that the CHPO achieves competitive performance across multiple experimental tasks. Our code is available at github.CHPO. [...] Extensive comparisons and ablation experiments demonstrate that the CHPO algorithm delivers competitive performance, particularly outperforming baseline algorithms in maximizing rewards while ensuring that average costs across multiple seeds satisfy safety constraints. [...] In this section, we conduct comprehensive comparative experiments between CHPO and previous hybrid-action RL methods in tasks with different hybrid action spaces and observation dimensions. |
| Researcher Affiliation | Collaboration | Ao Zhou1,2 Jiayi Guan1 Li Shen3 Fan Lu1 Sanqing Qu1 Junqiao Zhao1 Ziqiao Wang1 Ya Wu4 Guang Chen1,2 1Tongji University 2Shanghai Innovation Institute 3Sun Yat-Sen University 4CNNC Equipment Technology Research (Shanghai) Co., Ltd. |
| Pseudocode | Yes | The pseudo-code for the CHPO algorithm is shown in Algorithm 1 of Appendix C. |
| Open Source Code | Yes | Our code is available at github.CHPO. |
| Open Datasets | Yes | To assess the performance of CHPO in various tasks with parameterized action spaces, we select three widely adopted tasks from DI-engine [60] and establish a Parking task with parameterized action spaces as experimental tasks in this work. Concretely, we choose the Moving [2, 22, 33, 60], Sliding [2, 60], and Hard Move [22, 33, 60] tasks, all of which require agents to perform both discrete and continuous actions to reach a target area. [...] [60] Yazhe Niu, Jingxin Xu, Yuan Pu, Yunpeng Nie, Jinouwen Zhang, Shuai Hu, Liangxuan Zhao, Ming Zhang, and Yu Liu. Di-engine: A universal ai system/engine for decision intelligence. https://github.com/opendilab/DI-engine, 2021. |
| Dataset Splits | No | The paper mentions 'online testing during the training process' and 'results from experiments involving 40 episodes are conducted with 3 random seeds'. These refer to the experimental runs and evaluation metrics rather than specific training/test/validation splits for a dataset itself. |
| Hardware Specification | Yes | Experiments are run on machines that consist of AMD Ryzen Threadripper 3960X cores and RTX 3090. |
| Software Dependencies | No | The paper mentions 'DI-engine [60]' as a platform used for tasks. While it's a key component, no specific version number for DI-engine or any other software library/solver is provided within the text to ensure reproducibility. |
| Experiment Setup | Yes | We provide a detailed explanation of the experimental tasks in Section D.1. Table 2 displays the parameters of the neural network model utilized in our CHPO algorithm. [...] Table 2: The hyper-parameters of the CHPO algorithm model. Where s, ad, and ac denote the dimensions of the state, discrete action, and continuous action respectively. The batch size is set to 320 for the Moving and Sliding tasks, and 64 for the Hard Move and Parking tasks. Sort Hyper-parameters Setting Number of neurons s 256 128 64 64 64 1 Activation function Re Lu State-value(r) Number of networks 1 Learning rate 3.00e-04 Optimizer Adam Number of neurons s 256 128 64 64 64 1 Activation function Re Lu State-value(ci) Number of networks 1 Learning rate 3.00e-04 Optimizer Adam Number of neurons(encoder) s 256 128 64 64 Number of neurons(πd) encoder 64 ad Number of neurons(πc) encoder 64 ac Policy(πp) Activation function Re Lu Number of networks 2 Learning rate 3.00e-04 Optimizer Adam Batch size 320 or 64 Others Discount factor γ 0.99 Clip ratio ϵ 0.2 |