Phase-Parametric Policies for Reinforcement Learning in Cyclic Environments

Authors: Arjun Sharma, Kris Kitani

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that our proposed approach has superior modeling performance than traditional function approximators in cyclic environments. Experiments on both discrete and continuous state-action spaces show that the proposed Phase-Parametric Networks outperform traditional networks with phase as additional input for reinforcement learning problems with cyclic environments.
Researcher Affiliation Academia Arjun Sharma, Kris M. Kitani Robotics Institute, Carnegie Mellon University Pittsburgh, PA 15213 {arjuns2, kkitani}@cs.cmu.edu
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link for open-source code availability for the described methodology.
Open Datasets No The paper describes custom grid-world problems ('Freeway On-Ramp Problem', 'Flying with Wind Problem') and uses physics simulators ('Mu Jo Co') for continuous control tasks ('Hopper', 'Walker') rather than providing access information to a publicly available dataset.
Dataset Splits No The paper does not provide specific dataset split information (e.g., percentages or counts) for training, validation, or test sets, nor does it refer to predefined splits with citations for reproducibility.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions 'Mu Jo Co physics simulator' and 'Adam optimization' but does not provide specific version numbers for these or any other ancillary software dependencies.
Experiment Setup Yes Network configuration For both discrete problems, all networks had 2 hidden layers with 8 units in each layer and a linear output layer with 5 units (one for each action). Rectified Linear units (Re LU) were used for feed-forward architectures after every layer, except the output layer. The recurrent architectures used Gated Recurrent Units for the hidden layer. All networks were trained for 50000 iterations using Adam optimization (Kingma and Ba 2014) with an initial learning rate of 10 4. We followed the settings outlined in Lillicrap et al. (2015) when implementing the baseline DDPG and our Phase DDPG networks for the continuous problems. All networks had two hidden layers with 400 and 300 units and Re LU non-linearity. Actions were appended to the output of the first hidden layer for the critic networks. The networks were optimized using Adam with learning rates 10 4 and 10 3 for the actor and critic respectively. A replay memory of size 106 was used.