reproducibilityindex.ai

Learning the Arrow of Time for Problems in Reinforcement Learning

Authors: Nasim Rahaman, Steffen Wolf, Anirudh Goyal, Roman Remme, Yoshua Bengio

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results span a selection of discrete and continuous environments, and demonstrate for a class of stochastic processes that the learned arrow of time agrees reasonably well with a well known notion of an arrow of time due to Jordan, Kinderlehrer, and Otto (1998).
Researcher Affiliation	Academia	1Image Analysis and Learning Lab, Ruprecht-Karls-Universit at, Heidelberg 2Max-Planck Institute for Intelligent Systems, T ubingen 3Mila, Montr eal 4CIFAR Senior Fellow 5Canada CIFAR AI Chair
Pseudocode	Yes	The training algorithm is rather straightforward and can be summarized as follows (please refer to App B for the full algorithm). Appendix B is titled 'ALGORITHM' and contains 'Algorithm 1 Training the h-Potential'.
Open Source Code	No	The paper does not provide explicit links or statements about releasing the source code for their proposed methodology.
Open Datasets	Yes	The environment considered is a 7x7 2D world, where cells can be occupied by the agent, the goal and/or a vase (their respective positions are randomly sampled in each episode). (Vaseworld), Sokoban (warehouse-keeper) (Schrader, 2018), Mountain-Car (Sutton & Barto, 2011), Under-damped Pendulum (Brockman et al., 2016).
Dataset Splits	No	The paper describes general training procedures and batch sizes, but does not specify a separate validation dataset split with exact percentages or sample counts for hyperparameter tuning or early stopping.
Hardware Specification	Yes	All experiments were run on a workstation with 40 cores, 256 GB RAM and 2 n Vidia GTX 1080Ti.
Software Dependencies	No	The paper mentions adapting implementations from 'Shangtong (2018)' and 'Open AI Gym (Brockman et al., 2016)', but does not specify version numbers for these or other software libraries/frameworks (e.g., Python, PyTorch, TensorFlow).
Experiment Setup	Yes	The policy is parameterized by a 3-layer deep 256-unit wide (fully connected) Re LU network and trained via Duelling Double Deep Q-Learning (Van Hasselt et al., 2016; Wang et al., 2015). The discount factor is set to 0.99 and the target network is updated once every 200 iterations. For exploration, we use a 1 ϵ greedy policy, where ϵ is decayed linearly from 1 to 0.1 in the span of the ﬁrst 10000 iterations. The replay buffer stores 10000 experiences and the batch-size used is 10.