Trust-PCL: An Off-Policy Trust Region Method for Continuous Control
Authors: Ofir Nachum, Mohammad Norouzi, Kelvin Xu, Dale Schuurmans
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When evaluated on a number of continuous control tasks, Trust-PCL significantly improves the solution quality and sample efficiency of TRPO. We evaluate Trust-PCL against TRPO on a number of benchmark tasks. |
| Researcher Affiliation | Industry | Ofir Nachum, Mohammad Norouzi, Kelvin Xu, & Dale Schuurmans {ofirnachum,mnorouzi,kelvinxx,schuurmans}@google.com Google Brain |
| Pseudocode | Yes | A simplified pseudocode for Trust-PCL is presented in Algorithm 1. |
| Open Source Code | Yes | An implementation of Trust-PCL is available at https://github.com/tensorflow/models/ tree/master/research/pcl_rl |
| Open Datasets | Yes | We chose a number of control tasks available from Open AI Gym (Brockman et al., 2016). The first task, Acrobot, is a discrete-control task, while the remaining tasks (Half Cheetah, Swimmer, Hopper, Walker2d, and Ant) are well-known continuous-control tasks utilizing the Mu Jo Co environment (Todorov et al., 2012). |
| Dataset Splits | No | The paper describes hyperparameter search ('For each of the variants and for each environment, we performed a hyperparameter search to find the best hyperparameters.'), implying a validation process, but does not provide explicit train/validation/test dataset split percentages, counts, or specific predefined split references. |
| Hardware Specification | No | The paper states 'Experiments were performed using Tensorflow (Abadi et al., 2016)' but does not provide specific details about the hardware (e.g., CPU/GPU models, memory) used for these experiments. |
| Software Dependencies | No | The paper mentions software like Tensorflow (Abadi et al., 2016), OpenAI Gym (Brockman et al., 2016), Mu Jo Co (Todorov et al., 2012), and the Adam optimizer (Kingma & Ba, 2015), but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For TRPO we trained using batches of Q = 25, 000 steps... Trust-PCL ... alternate between collecting P = 10 steps from the environment and performing a single gradient step based on a batch of size Q = 64 sub-episodes of length P from the replay buffer, with a recency weight of β = 0.001... To maintain stability we use α = 0.99 and we modified the loss from squared loss to Huber loss... For Trust-PCL (on-policy), the policy is trained by taking a single gradient step using the Adam optimizer (Kingma & Ba, 2015) with learning rate 0.001. The value network update ... we perform 5 gradients steps with learning rate 0.001... We use κ = 0.95. For Trust-PCL (off-policy), both the policy and value parameters are updated in a single step using the Adam optimizer with learning rate 0.0001. We fix the discount to γ = 0.995 for all environments. For TRPO we performed a grid search over ϵ {0.01, 0.02, 0.05, 0.1}, d {10, 50}. For Trust-PCL we performed a grid search over ϵ {0.001, 0.002, 0.005, 0.01}, d {10, 50}. |