reproducibilityindex.ai

Clipped Action Policy Gradient

Authors: Yasuhiro Fujita, Shin-ichi Maeda

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that CAPG generally outperforms the conventional estimator, indicating that it is a better policy gradient estimator for continuous control tasks. In this section, we evaluate the performance of CAPG compared to the conventional policy gradient estimator, which we call PG, in problems with action bounds.
Researcher Affiliation	Industry	Yasuhiro Fujita 1 Shin-ichi Maeda 1 1Preferred Networks, Inc., Japan.
Pseudocode	No	The paper describes algorithmic steps and equations in Section 3.3 'Implementation' but does not present them in a formally labeled pseudocode or algorithm block.
Open Source Code	Yes	The source code is available at https: //github.com/pfnet-research/capg.
Open Datasets	Yes	For our experiments, we used 10 Mu Jo Co-simulated environments implemented in Open AI Gym that are widely used as benchmark tasks for deep RL algorithms (Schulman et al., 2017; Henderson et al., 2018; Ciosek & Whiteson, 2018; Gu et al., 2017b; Duan et al., 2016; Dhariwal et al., 2017). The names of the environments are listed along with their observation and action spaces in Table 1.
Dataset Splits	No	The paper uses MuJoCo-simulated environments for continuous control, which typically involve online data generation rather than static datasets with explicit train/validation/test splits. Therefore, no specific dataset split information (percentages or sample counts) is provided.
Hardware Specification	No	The paper does not provide specific hardware details such as CPU/GPU models, memory, or cloud instance types used for running the experiments.
Software Dependencies	No	The paper mentions software components like Adam, PPO, TRPO, and Open AI Gym, but does not provide specific version numbers for these or any other ancillary software.
Experiment Setup	Yes	The following experimental settings were used unless otherwise stated. Actions were scalars, i.e., d = 1. The parameters of a policy were initialized as zero mean and unit variance for each dimension. Each policy update used a batch of 5 (action, reward) pairs. The average reward in a batch was used as a baseline that was subtracted from each reward. Adam (Kingma & Ba, 2015) with its default hyperparameters was used to update the parameters. We followed the hyperparameter settings used in (Henderson et al., 2018), except that the learning rate of Adam used by PPO was reduced to 3e-5 for 10 million timesteps training to obtain reasonable performance with PG. We used separate neural networks with two hidden layers, each of which has 64 hidden units with tanh nonlinearities, for both a policy and a state value function. The policy network outputs the mean of a multivariate Gaussian distribution. The main diagonal of the covariance matrix was separately parameterized as a logarithm of the standard deviation for each dimension.