Policy Optimization for Continuous Reinforcement Learning
Authors: HANYANG ZHAO, Wenpin Tang, David Yao
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through numerical experiments, we demonstrate the effectiveness and advantages of our approach. |
| Researcher Affiliation | Academia | Hanyang Zhao Columbia University hz2684@columbia.edu Wenpin Tang Columbia University wt2319@columbia.edu David D. Yao Columbia University yao@columbia.edu |
| Pseudocode | Yes | Algorithm 1 CPG: Policy Gradient with exp(β) random rollout; Algorithm 2 CPPO: PPO with adaptive penalty constant; Algorithm 3 CPPO: PPO with adaptive penalty constant (linear KL-divergence) |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | No | The paper describes setting up continuous control environments (LQ stochastic control, 2-dimensional optimal pair trading) with specific parameters, but does not refer to existing publicly available datasets with links or citations in the conventional sense of a fixed dataset. |
| Dataset Splits | No | The paper discusses policy evaluation and training steps within continuous reinforcement learning algorithms (e.g., updating critic parameters), but it does not specify traditional training, validation, or test dataset splits, as the experiments involve continuous control tasks rather than pre-split datasets. |
| Hardware Specification | No | The paper does not specify the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper describes algorithm implementations and theoretical aspects but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Table 1: Hyper-parameter values for Example 1; Table 2: Hyperparameter values for Example 2 |