Learning the Arrow of Time for Problems in Reinforcement Learning
Authors: Nasim Rahaman, Steffen Wolf, Anirudh Goyal, Roman Remme, Yoshua Bengio
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results span a selection of discrete and continuous environments, and demonstrate for a class of stochastic processes that the learned arrow of time agrees reasonably well with a well known notion of an arrow of time due to Jordan, Kinderlehrer, and Otto (1998). |
| Researcher Affiliation | Academia | 1Image Analysis and Learning Lab, Ruprecht-Karls-Universit at, Heidelberg 2Max-Planck Institute for Intelligent Systems, T ubingen 3Mila, Montr eal 4CIFAR Senior Fellow 5Canada CIFAR AI Chair |
| Pseudocode | Yes | The training algorithm is rather straightforward and can be summarized as follows (please refer to App B for the full algorithm). Appendix B is titled 'ALGORITHM' and contains 'Algorithm 1 Training the h-Potential'. |
| Open Source Code | No | The paper does not provide explicit links or statements about releasing the source code for their proposed methodology. |
| Open Datasets | Yes | The environment considered is a 7x7 2D world, where cells can be occupied by the agent, the goal and/or a vase (their respective positions are randomly sampled in each episode). (Vaseworld), Sokoban (warehouse-keeper) (Schrader, 2018), Mountain-Car (Sutton & Barto, 2011), Under-damped Pendulum (Brockman et al., 2016). |
| Dataset Splits | No | The paper describes general training procedures and batch sizes, but does not specify a separate validation dataset split with exact percentages or sample counts for hyperparameter tuning or early stopping. |
| Hardware Specification | Yes | All experiments were run on a workstation with 40 cores, 256 GB RAM and 2 n Vidia GTX 1080Ti. |
| Software Dependencies | No | The paper mentions adapting implementations from 'Shangtong (2018)' and 'Open AI Gym (Brockman et al., 2016)', but does not specify version numbers for these or other software libraries/frameworks (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | The policy is parameterized by a 3-layer deep 256-unit wide (fully connected) Re LU network and trained via Duelling Double Deep Q-Learning (Van Hasselt et al., 2016; Wang et al., 2015). The discount factor is set to 0.99 and the target network is updated once every 200 iterations. For exploration, we use a 1 ϵ greedy policy, where ϵ is decayed linearly from 1 to 0.1 in the span of the first 10000 iterations. The replay buffer stores 10000 experiences and the batch-size used is 10. |