Reducing Sampling Error in Batch Temporal Difference Learning
Authors: Brahma Pavse, Ishan Durugkar, Josiah Hanna, Peter Stone
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we conduct an empirical evaluation of PSEC-TD(0) on three batch value function learning tasks, with a hyperparameter sensitivity analysis, and show that PSEC-TD(0) produces value function estimates with lower mean squared error than TD(0). |
| Researcher Affiliation | Collaboration | 1The University of Texas at Austin 2School of Informatics, University of Edinburgh 3To be joining the Computer Sciences Department, University of Wisconsin Madison 4Sony AI. |
| Pseudocode | Yes | Algorithm 1 Batch Linear TD(0) to estimate v e |
| Open Source Code | No | The paper does not contain any explicit statements or links about providing open-source code for the described methodology. |
| Open Datasets | No | The paper mentions standard reinforcement learning domains like Gridworld, Cart Pole, and Inverted Pendulum. However, it does not provide concrete access information (e.g., specific links, DOIs, or citations with author/year) for publicly available datasets used in the experiments. For Cart Pole and Inverted Pendulum, it describes generating data via 'Monte Carlo rollouts'. |
| Dataset Splits | Yes | In all PSEC training settings, PSEC performs gradient steps using the full batch of data, uses a separate batch of data as the validation data, and terminates training according to early stopping. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions general terms like 'trained our models'. |
| Software Dependencies | No | The paper mentions 'Openai gym', 'MuJoCo', 'Adam', and 'PPO' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | In all experiments, the value function learning algorithm iterates over the whole batch of data until convergence, after which the MSVE of the final value function is computed. Some experiments include a parameter sweep over the hyperparameters, which can be found in Appendix G. ... The results shown here are with sweeps over only the value function model class and PSEC learning rate. |