Efficient Reinforcement Learning with Hierarchies of Machines by Leveraging Internal Transitions
Authors: Aijun Bai, Stuart Russell
IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on the benchmark Taxi domain [Dietterich, 1999] and a much more complex Robo Cup Keepaway domain [Stone et al., 2005]. |
| Researcher Affiliation | Academia | Aijun Bai UC Berkeley aijunbai@berkeley.edu Stuart Russell UC Berkeley russell@cs.berkeley.edu |
| Pseudocode | Yes | Algorithm 1 gives the pseudo-code for running a HAM, where the Execute function executes an action in the environment and returns the next environment state, and the Choose function picks the next machine state given the updated stack z, the current environment state s... and Algorithm 3 gives the pseudo-code of the HAMQ-INT algorithm. |
| Open Source Code | No | The paper does not provide any explicit statements about the release of source code or links to a code repository for the described methodology. |
| Open Datasets | Yes | We conduct experiments on the benchmark Taxi domain [Dietterich, 1999] and a much more complex Robo Cup Keepaway domain [Stone et al., 2005]. |
| Dataset Splits | No | The paper mentions general learning parameters like learning rate and exploration policy, but does not specify dataset splits (e.g., train/validation/test percentages or sample counts) or cross-validation details. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as exact GPU/CPU models, memory amounts, or cloud instance types used for running experiments. |
| Software Dependencies | No | The paper mentions using "SARSA learning rule with a linear function approximator" and refers to "ALisp" but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | For all learning algorithms, the learning rate is set to be 0.125; an ϵ-Greedy policy which selects a random action with probability 0.01 is used to balance between exploration and exploitation. |