reproducibilityindex.ai

Loss Dynamics of Temporal Difference Reinforcement Learning

Authors: Blake Bordelon, Paul Masset, Henry Kuo, Cengiz Pehlevan

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our theory is derived under a Gaussian equivalence hypothesis where averages over the random trajectories are replaced with temporally correlated Gaussian feature averages and we validate our assumptions on small scale Markov Decision Processes. We perform TD learning with Fourier features to evaluate a pre-trained policy on Mountain Car-v0.
Researcher Affiliation	Academia	Blake Bordelon, Paul Masset, Henry Kuo & Cengiz Pehlevan John Paulson School of Engineering and Applied Sciences, Center for Brain Science, Kempner Institute for the Study of Natural & Artiﬁcial Intelligence, Harvard University Cambridge MA, 02138 blake_bordelon@g.harvard.edu, cpehlevan@g.harvard.edu
Pseudocode	No	The paper describes mathematical derivations and theoretical frameworks but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code to generate the Figures is provided in the Supplementary Material as a Jupyter Notebook at the following Github repository https://github.com/Pehlevan-Group/TD-RL-dynamics.
Open Datasets	No	The paper describes using the Mountain Car-v0 environment and generating its own trajectories and policies, but it does not provide concrete access information (link, DOI, citation) to a specific publicly available dataset used for training/evaluation.
Dataset Splits	No	The paper describes experimental setups and data generation within environments like Mountain Car-v0, but it does not specify explicit train/validation/test dataset splits with percentages or sample counts.
Hardware Specification	Yes	Numerical experiments were performed on a NVIDIA SMX4-A100-80GB GPU using JAX to vectorize repetitive aspects of the experiments.
Software Dependencies	No	The paper mentions using JAX for numerical experiments but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	First, we train a policy with tabular ϵ-greedy Q-Learning (ϵ = 0.1, γ = 0.99, η = 0.01) to learn policy π. The position and velocity are discretized into 42 and 28 states, respectively. The learned policy π is not optimal but consistently reaches goal within 350 timesteps. Therefore, each episode is set to have a length of 350 timesteps.