Loss Dynamics of Temporal Difference Reinforcement Learning

Authors: Blake Bordelon, Paul Masset, Henry Kuo, Cengiz Pehlevan

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theory is derived under a Gaussian equivalence hypothesis where averages over the random trajectories are replaced with temporally correlated Gaussian feature averages and we validate our assumptions on small scale Markov Decision Processes. We perform TD learning with Fourier features to evaluate a pre-trained policy on Mountain Car-v0.
Researcher Affiliation Academia Blake Bordelon, Paul Masset, Henry Kuo & Cengiz Pehlevan John Paulson School of Engineering and Applied Sciences, Center for Brain Science, Kempner Institute for the Study of Natural & Artificial Intelligence, Harvard University Cambridge MA, 02138 blake_bordelon@g.harvard.edu, cpehlevan@g.harvard.edu
Pseudocode No The paper describes mathematical derivations and theoretical frameworks but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The code to generate the Figures is provided in the Supplementary Material as a Jupyter Notebook at the following Github repository https://github.com/Pehlevan-Group/TD-RL-dynamics.
Open Datasets No The paper describes using the Mountain Car-v0 environment and generating its own trajectories and policies, but it does not provide concrete access information (link, DOI, citation) to a specific publicly available dataset used for training/evaluation.
Dataset Splits No The paper describes experimental setups and data generation within environments like Mountain Car-v0, but it does not specify explicit train/validation/test dataset splits with percentages or sample counts.
Hardware Specification Yes Numerical experiments were performed on a NVIDIA SMX4-A100-80GB GPU using JAX to vectorize repetitive aspects of the experiments.
Software Dependencies No The paper mentions using JAX for numerical experiments but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes First, we train a policy with tabular ϵ-greedy Q-Learning (ϵ = 0.1, γ = 0.99, η = 0.01) to learn policy π. The position and velocity are discretized into 42 and 28 states, respectively. The learned policy π is not optimal but consistently reaches goal within 350 timesteps. Therefore, each episode is set to have a length of 350 timesteps.