An Inference-Based Policy Gradient Method for Learning Options
Authors: Matthew Smith, Herke Hoof, Joelle Pineau
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In order to evaluate the effectiveness of our algorithm, as well as the qualitative attributes of the options learned, we examine its performance across several standardized continuous control environments as implemented in the Open AI Gym (Brockman et al., 2016) in the Mu Jo Co physics simulator (Todorov et al., 2012). In particular, we examine the following environments: Hopper-v1 (observation dimension: 11, action dimension: 3) Walker2d-v1 (observation dimension: 17, action dimension: 6) Half Cheetah-v1 (observation dimension: 17, action dimension: 6) Swimmer-v1 (observation dimension: 8, action dimension: 2). Generally, they all require the agent to learn to operate joint motors in order to move the agent in a particular direction, with penalties for unnecessary actions. Together, they are considered to be reasonable benchmarks for state-of-the-art continuous RL algorithms. 6.1. Comparison of performance We compared the performance of our algorithm (IOPG) with results from option-critic (OC) and asynchronous actor-critic (A3C) methods, as described in Mnih et al. (2016). The results of these experiments are shown in Fig. 2. |
| Researcher Affiliation | Academia | Matthew J. A. Smith 1 Herke Van Hoof 2 Joelle Pineau 1 1Department of Computer Science, Mc Gill University, Quebec, Canada 2Informatics Institute, University of Amsterdam, The Netherlands. Correspondence to: Matthew J. A. Smith <matthew.smith5@mail.mcgill.ca>. |
| Pseudocode | Yes | The algorithm for learning options to optimize returns through a series of interactions with the environment is given in Algorithm 1. While this algorithm can only be applied in the episodic RL setup, it is also possible to employ the technical insight shown here in an online manner, which is the topic of the next section. Algorithm 1: Inferred Option Policy Gradient (IOPG) initialize parameters randomly foreach episode do ω0 πΩ(ω|s0) // sample initial option for t 0, . . . , T do at πωt(st) // sample action from option Get st+1 and rt from system ωt+1 πΩ(ωt+1|ωt, st+1) // sample option end Update ν according to (4), using sampled data θ, ϑ, and ξ according to (3), using sampled data end Algorithm 2: Inferred Option Actor Critic (IOAC) initialize ψ randomly foreach episode do gωψ 0 s s0 ω πΩ(s) for t 0, . . . , T do a πω(a|s) s , r step(a, s) Update ν according to TD Update gωψ according to (5) Substitute gωψ into (3) to update θ and ϑ Draw option termination b β(s )ω if b then ω πΩ(s ) end s s end end |
| Open Source Code | No | The paper does not provide an explicit statement about releasing code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | In order to evaluate the effectiveness of our algorithm, as well as the qualitative attributes of the options learned, we examine its performance across several standardized continuous control environments as implemented in the Open AI Gym (Brockman et al., 2016) in the Mu Jo Co physics simulator (Todorov et al., 2012). In particular, we examine the following environments: Hopper-v1 (observation dimension: 11, action dimension: 3) Walker2d-v1 (observation dimension: 17, action dimension: 6) Half Cheetah-v1 (observation dimension: 17, action dimension: 6) Swimmer-v1 (observation dimension: 8, action dimension: 2). |
| Dataset Splits | No | The paper mentions using OpenAI Gym environments and comparing performance but does not specify explicit train/validation/test splits, only that it is an 'episodic setup'. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments, only mentioning 'multiple agents operating in parallel' and 'asynchronous threads'. |
| Software Dependencies | No | The paper mentions 'Open AI Gym (Brockman et al., 2016)' and 'Mu Jo Co physics simulator (Todorov et al., 2012)' and 'RMSProp (Tieleman & Hinton, 2012)' but does not provide specific version numbers for these or other software components. |
| Experiment Setup | Yes | Our model architecture for all three algorithms closely follows that of Schulman et al. (2017). The policies and value functions were represented using separate feed-forward neural networks, with no parameters shared. For each agent, both the value function and the policies used two hidden layers of 64 units with tanh activation functions. The IOPG and OC methods shared these parameters across all policy and termination networks. The option sub-policies and A3C policies were implemented as linear layers on top of this, representing the mean of a Gaussian distribution. The variance of the policy was parametrized by a linear softplus layer. Option termination was given by a linear sigmoid layer for each option. The policy over options, for OC and IOPG methods, was represented using a final linear softmax layer, of size equal to the number of options available. The value function for IOPG and AC methods was represented using a final linear layer of size 1, and for OC, size |Ω|. All weight matrices were initialized to have normalized rows. RMSProp (Tieleman & Hinton, 2012) was used to optimize parameters for all agents. We employ a single shared set of RMSProp parameters across all asynchronous threads. Additionally, entropy regularization was used during optimization for the AC policies, the option policies and the policies over options. This regularization encourages exploration, and prevents the policies from converging to single repeated actions, as policy gradient methods parametrized by neural networks often suffer from this problem (Mnih et al., 2016). |