Target Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation

Authors: Fengdi Che, Chenjun Xiao, Jincheng Mei, Bo Dai, Ramki Gummadi, Oscar A Ramirez, Christopher K Harris, A. Rupam Mahmood, Dale Schuurmans

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our primary result focuses on temporal difference estimation for prediction, providing high-probability value estimation error bounds and empirical analysis on Baird s counterexample and a Four-room task.Empirically, we demonstrate on Baird s counterexample that the over-parameterized target TD converges faster than other existing solutions to the deadly triad, such as residual minimization (RM) or gradient TD methods, while using less memory than convergent methods like LSTD.
Researcher Affiliation Collaboration 1Department of Computing Science, University of Alberta 2School of Data Science, The Chinese University of Hong Kong, Shenzhen 3Google DeepMind 4School of Computational Science and Engineering, Georgia Tech 5Figure 6The work was done while the author was at Google. 7Uber 8CIFAR AI Chair, Amii.
Pseudocode No The paper describes algorithms using mathematical equations but does not provide pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about releasing source code or include links to code repositories.
Open Datasets No The paper states: "On Baird counterexample, states are sampled from a uniform distribution..." and "In this section, we empirically analyze the value prediction errors in an episodic Four Room task using offline data from trajectories under a random behaviour policy." While these are well-known environments, the paper describes data generation rather than providing access information (URL, DOI, citation) to specific pre-collected datasets used in their experiments.
Dataset Splits No The paper does not explicitly provide details on training, validation, or test dataset splits (e.g., percentages, sample counts, or predefined split references).
Hardware Specification No The paper does not explicitly describe the hardware specifications (e.g., specific GPU/CPU models, memory) used to run its experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies (e.g., libraries, frameworks, or programming languages) used in the experiments.
Experiment Setup Yes The discount factor is set to be γ = 0.95. Hyperparameters choices are in A.7. Table 2. The table shows hyperparameters for all algorithms tuned on the Baird counterexample. All hyperparameters are found by grid search. Table 3. The table shows hyperparameters for all algorithms tuned on the Four Room Task. All hyperparameters are found by grid search. All empirical results are averaged over 10 random seeds.