Reducing Sampling Error in Batch Temporal Difference Learning

Authors: Brahma Pavse, Ishan Durugkar, Josiah Hanna, Peter Stone

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we conduct an empirical evaluation of PSEC-TD(0) on three batch value function learning tasks, with a hyperparameter sensitivity analysis, and show that PSEC-TD(0) produces value function estimates with lower mean squared error than TD(0).
Researcher Affiliation Collaboration 1The University of Texas at Austin 2School of Informatics, University of Edinburgh 3To be joining the Computer Sciences Department, University of Wisconsin Madison 4Sony AI.
Pseudocode Yes Algorithm 1 Batch Linear TD(0) to estimate v e
Open Source Code No The paper does not contain any explicit statements or links about providing open-source code for the described methodology.
Open Datasets No The paper mentions standard reinforcement learning domains like Gridworld, Cart Pole, and Inverted Pendulum. However, it does not provide concrete access information (e.g., specific links, DOIs, or citations with author/year) for publicly available datasets used in the experiments. For Cart Pole and Inverted Pendulum, it describes generating data via 'Monte Carlo rollouts'.
Dataset Splits Yes In all PSEC training settings, PSEC performs gradient steps using the full batch of data, uses a separate batch of data as the validation data, and terminates training according to early stopping.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions general terms like 'trained our models'.
Software Dependencies No The paper mentions 'Openai gym', 'MuJoCo', 'Adam', and 'PPO' but does not provide specific version numbers for these software components.
Experiment Setup Yes In all experiments, the value function learning algorithm iterates over the whole batch of data until convergence, after which the MSVE of the final value function is computed. Some experiments include a parameter sweep over the hyperparameters, which can be found in Appendix G. ... The results shown here are with sweeps over only the value function model class and PSEC learning rate.