reproducibilityindex.ai

Trust-PCL: An Off-Policy Trust Region Method for Continuous Control

Authors: Ofir Nachum, Mohammad Norouzi, Kelvin Xu, Dale Schuurmans

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	When evaluated on a number of continuous control tasks, Trust-PCL signiﬁcantly improves the solution quality and sample efﬁciency of TRPO. We evaluate Trust-PCL against TRPO on a number of benchmark tasks.
Researcher Affiliation	Industry	Oﬁr Nachum, Mohammad Norouzi, Kelvin Xu, & Dale Schuurmans {ofirnachum,mnorouzi,kelvinxx,schuurmans}@google.com Google Brain
Pseudocode	Yes	A simpliﬁed pseudocode for Trust-PCL is presented in Algorithm 1.
Open Source Code	Yes	An implementation of Trust-PCL is available at https://github.com/tensorflow/models/ tree/master/research/pcl_rl
Open Datasets	Yes	We chose a number of control tasks available from Open AI Gym (Brockman et al., 2016). The ﬁrst task, Acrobot, is a discrete-control task, while the remaining tasks (Half Cheetah, Swimmer, Hopper, Walker2d, and Ant) are well-known continuous-control tasks utilizing the Mu Jo Co environment (Todorov et al., 2012).
Dataset Splits	No	The paper describes hyperparameter search ('For each of the variants and for each environment, we performed a hyperparameter search to ﬁnd the best hyperparameters.'), implying a validation process, but does not provide explicit train/validation/test dataset split percentages, counts, or specific predefined split references.
Hardware Specification	No	The paper states 'Experiments were performed using Tensorﬂow (Abadi et al., 2016)' but does not provide specific details about the hardware (e.g., CPU/GPU models, memory) used for these experiments.
Software Dependencies	No	The paper mentions software like Tensorﬂow (Abadi et al., 2016), OpenAI Gym (Brockman et al., 2016), Mu Jo Co (Todorov et al., 2012), and the Adam optimizer (Kingma & Ba, 2015), but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	For TRPO we trained using batches of Q = 25, 000 steps... Trust-PCL ... alternate between collecting P = 10 steps from the environment and performing a single gradient step based on a batch of size Q = 64 sub-episodes of length P from the replay buffer, with a recency weight of β = 0.001... To maintain stability we use α = 0.99 and we modiﬁed the loss from squared loss to Huber loss... For Trust-PCL (on-policy), the policy is trained by taking a single gradient step using the Adam optimizer (Kingma & Ba, 2015) with learning rate 0.001. The value network update ... we perform 5 gradients steps with learning rate 0.001... We use κ = 0.95. For Trust-PCL (off-policy), both the policy and value parameters are updated in a single step using the Adam optimizer with learning rate 0.0001. We ﬁx the discount to γ = 0.995 for all environments. For TRPO we performed a grid search over ϵ {0.01, 0.02, 0.05, 0.1}, d {10, 50}. For Trust-PCL we performed a grid search over ϵ {0.001, 0.002, 0.005, 0.01}, d {10, 50}.