Learning the Arrow of Time for Problems in Reinforcement Learning

Authors: Nasim Rahaman, Steffen Wolf, Anirudh Goyal, Roman Remme, Yoshua Bengio

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results span a selection of discrete and continuous environments, and demonstrate for a class of stochastic processes that the learned arrow of time agrees reasonably well with a well known notion of an arrow of time due to Jordan, Kinderlehrer, and Otto (1998).
Researcher Affiliation Academia 1Image Analysis and Learning Lab, Ruprecht-Karls-Universit at, Heidelberg 2Max-Planck Institute for Intelligent Systems, T ubingen 3Mila, Montr eal 4CIFAR Senior Fellow 5Canada CIFAR AI Chair
Pseudocode Yes The training algorithm is rather straightforward and can be summarized as follows (please refer to App B for the full algorithm). Appendix B is titled 'ALGORITHM' and contains 'Algorithm 1 Training the h-Potential'.
Open Source Code No The paper does not provide explicit links or statements about releasing the source code for their proposed methodology.
Open Datasets Yes The environment considered is a 7x7 2D world, where cells can be occupied by the agent, the goal and/or a vase (their respective positions are randomly sampled in each episode). (Vaseworld), Sokoban (warehouse-keeper) (Schrader, 2018), Mountain-Car (Sutton & Barto, 2011), Under-damped Pendulum (Brockman et al., 2016).
Dataset Splits No The paper describes general training procedures and batch sizes, but does not specify a separate validation dataset split with exact percentages or sample counts for hyperparameter tuning or early stopping.
Hardware Specification Yes All experiments were run on a workstation with 40 cores, 256 GB RAM and 2 n Vidia GTX 1080Ti.
Software Dependencies No The paper mentions adapting implementations from 'Shangtong (2018)' and 'Open AI Gym (Brockman et al., 2016)', but does not specify version numbers for these or other software libraries/frameworks (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes The policy is parameterized by a 3-layer deep 256-unit wide (fully connected) Re LU network and trained via Duelling Double Deep Q-Learning (Van Hasselt et al., 2016; Wang et al., 2015). The discount factor is set to 0.99 and the target network is updated once every 200 iterations. For exploration, we use a 1 ϵ greedy policy, where ϵ is decayed linearly from 1 to 0.1 in the span of the first 10000 iterations. The replay buffer stores 10000 experiences and the batch-size used is 10.