Loss Dynamics of Temporal Difference Reinforcement Learning
Authors: Blake Bordelon, Paul Masset, Henry Kuo, Cengiz Pehlevan
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theory is derived under a Gaussian equivalence hypothesis where averages over the random trajectories are replaced with temporally correlated Gaussian feature averages and we validate our assumptions on small scale Markov Decision Processes. We perform TD learning with Fourier features to evaluate a pre-trained policy on Mountain Car-v0. |
| Researcher Affiliation | Academia | Blake Bordelon, Paul Masset, Henry Kuo & Cengiz Pehlevan John Paulson School of Engineering and Applied Sciences, Center for Brain Science, Kempner Institute for the Study of Natural & Artificial Intelligence, Harvard University Cambridge MA, 02138 blake_bordelon@g.harvard.edu, cpehlevan@g.harvard.edu |
| Pseudocode | No | The paper describes mathematical derivations and theoretical frameworks but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code to generate the Figures is provided in the Supplementary Material as a Jupyter Notebook at the following Github repository https://github.com/Pehlevan-Group/TD-RL-dynamics. |
| Open Datasets | No | The paper describes using the Mountain Car-v0 environment and generating its own trajectories and policies, but it does not provide concrete access information (link, DOI, citation) to a specific publicly available dataset used for training/evaluation. |
| Dataset Splits | No | The paper describes experimental setups and data generation within environments like Mountain Car-v0, but it does not specify explicit train/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | Yes | Numerical experiments were performed on a NVIDIA SMX4-A100-80GB GPU using JAX to vectorize repetitive aspects of the experiments. |
| Software Dependencies | No | The paper mentions using JAX for numerical experiments but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | First, we train a policy with tabular ϵ-greedy Q-Learning (ϵ = 0.1, γ = 0.99, η = 0.01) to learn policy π. The position and velocity are discretized into 42 and 28 states, respectively. The learned policy π is not optimal but consistently reaches goal within 350 timesteps. Therefore, each episode is set to have a length of 350 timesteps. |