Clipped Action Policy Gradient
Authors: Yasuhiro Fujita, Shin-ichi Maeda
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that CAPG generally outperforms the conventional estimator, indicating that it is a better policy gradient estimator for continuous control tasks. In this section, we evaluate the performance of CAPG compared to the conventional policy gradient estimator, which we call PG, in problems with action bounds. |
| Researcher Affiliation | Industry | Yasuhiro Fujita 1 Shin-ichi Maeda 1 1Preferred Networks, Inc., Japan. |
| Pseudocode | No | The paper describes algorithmic steps and equations in Section 3.3 'Implementation' but does not present them in a formally labeled pseudocode or algorithm block. |
| Open Source Code | Yes | The source code is available at https: //github.com/pfnet-research/capg. |
| Open Datasets | Yes | For our experiments, we used 10 Mu Jo Co-simulated environments implemented in Open AI Gym that are widely used as benchmark tasks for deep RL algorithms (Schulman et al., 2017; Henderson et al., 2018; Ciosek & Whiteson, 2018; Gu et al., 2017b; Duan et al., 2016; Dhariwal et al., 2017). The names of the environments are listed along with their observation and action spaces in Table 1. |
| Dataset Splits | No | The paper uses MuJoCo-simulated environments for continuous control, which typically involve online data generation rather than static datasets with explicit train/validation/test splits. Therefore, no specific dataset split information (percentages or sample counts) is provided. |
| Hardware Specification | No | The paper does not provide specific hardware details such as CPU/GPU models, memory, or cloud instance types used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like Adam, PPO, TRPO, and Open AI Gym, but does not provide specific version numbers for these or any other ancillary software. |
| Experiment Setup | Yes | The following experimental settings were used unless otherwise stated. Actions were scalars, i.e., d = 1. The parameters of a policy were initialized as zero mean and unit variance for each dimension. Each policy update used a batch of 5 (action, reward) pairs. The average reward in a batch was used as a baseline that was subtracted from each reward. Adam (Kingma & Ba, 2015) with its default hyperparameters was used to update the parameters. We followed the hyperparameter settings used in (Henderson et al., 2018), except that the learning rate of Adam used by PPO was reduced to 3e-5 for 10 million timesteps training to obtain reasonable performance with PG. We used separate neural networks with two hidden layers, each of which has 64 hidden units with tanh nonlinearities, for both a policy and a state value function. The policy network outputs the mean of a multivariate Gaussian distribution. The main diagonal of the covariance matrix was separately parameterized as a logarithm of the standard deviation for each dimension. |