Trust-PCL: An Off-Policy Trust Region Method for Continuous Control

Authors: Ofir Nachum, Mohammad Norouzi, Kelvin Xu, Dale Schuurmans

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When evaluated on a number of continuous control tasks, Trust-PCL significantly improves the solution quality and sample efficiency of TRPO. We evaluate Trust-PCL against TRPO on a number of benchmark tasks.
Researcher Affiliation Industry Ofir Nachum, Mohammad Norouzi, Kelvin Xu, & Dale Schuurmans {ofirnachum,mnorouzi,kelvinxx,schuurmans}@google.com Google Brain
Pseudocode Yes A simplified pseudocode for Trust-PCL is presented in Algorithm 1.
Open Source Code Yes An implementation of Trust-PCL is available at https://github.com/tensorflow/models/ tree/master/research/pcl_rl
Open Datasets Yes We chose a number of control tasks available from Open AI Gym (Brockman et al., 2016). The first task, Acrobot, is a discrete-control task, while the remaining tasks (Half Cheetah, Swimmer, Hopper, Walker2d, and Ant) are well-known continuous-control tasks utilizing the Mu Jo Co environment (Todorov et al., 2012).
Dataset Splits No The paper describes hyperparameter search ('For each of the variants and for each environment, we performed a hyperparameter search to find the best hyperparameters.'), implying a validation process, but does not provide explicit train/validation/test dataset split percentages, counts, or specific predefined split references.
Hardware Specification No The paper states 'Experiments were performed using Tensorflow (Abadi et al., 2016)' but does not provide specific details about the hardware (e.g., CPU/GPU models, memory) used for these experiments.
Software Dependencies No The paper mentions software like Tensorflow (Abadi et al., 2016), OpenAI Gym (Brockman et al., 2016), Mu Jo Co (Todorov et al., 2012), and the Adam optimizer (Kingma & Ba, 2015), but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes For TRPO we trained using batches of Q = 25, 000 steps... Trust-PCL ... alternate between collecting P = 10 steps from the environment and performing a single gradient step based on a batch of size Q = 64 sub-episodes of length P from the replay buffer, with a recency weight of β = 0.001... To maintain stability we use α = 0.99 and we modified the loss from squared loss to Huber loss... For Trust-PCL (on-policy), the policy is trained by taking a single gradient step using the Adam optimizer (Kingma & Ba, 2015) with learning rate 0.001. The value network update ... we perform 5 gradients steps with learning rate 0.001... We use κ = 0.95. For Trust-PCL (off-policy), both the policy and value parameters are updated in a single step using the Adam optimizer with learning rate 0.0001. We fix the discount to γ = 0.995 for all environments. For TRPO we performed a grid search over ϵ {0.01, 0.02, 0.05, 0.1}, d {10, 50}. For Trust-PCL we performed a grid search over ϵ {0.001, 0.002, 0.005, 0.01}, d {10, 50}.