Understanding and Leveraging Overparameterization in Recursive Value Estimation

Authors: Chenjun Xiao, Bo Dai, Jincheng Mei, Oscar A Ramirez, Ramki Gummadi, Chris Harris, Dale Schuurmans

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically we find that these regularizers dramatically improve the stability of TD and FVI, while allowing RM to match and even sometimes surpass their generalization performance with assured stability.
Researcher Affiliation Collaboration 1Google 2Department of Computing Science, University of Alberta
Pseudocode No The paper provides mathematical update equations for algorithms like RM, TD, and FVI (e.g., Eq. 5, 6, 9), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not include any explicit statement about releasing source code, provide a link to a code repository, or mention code availability in supplementary materials for the described methodology.
Open Datasets Yes We consider both discrete and continuous control benchmarks in this analysis. For the discrete action environments, we use DQN (Mnih et al., 2015) as the baseline algorithm to add our regularizers. For continuous control environments, we use QT-Opt (Kalashnikov et al., 2018) as the baseline algorithm... We provide extra experiment results on four Mujoco control problems... Half Cheetah, Hopper, Ant, and Walker2d.
Dataset Splits No The paper mentions using a 'fixed offline data set' and 'replay buffer with 10k tuples' but does not specify explicit training/validation/test dataset splits, percentages, or sample counts for reproducibility.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper mentions using DQN and QT-Opt as baseline algorithms, but it does not list any specific software dependencies (e.g., programming languages, libraries, or solvers) with their version numbers required to replicate the experiment.
Experiment Setup Yes Appendix B.1 Acrobot: Replay buffer with 10k tuples sampled using a random policy across trajectories with maximum episode length of 64. A DQN with hidden units consisting of fully connected layers with (100, 100) units. Batch size 64. Learning rate of 1e-3. Regularized RM with weight of 2e-2 on Rφ and 1e-4 on Rw. Regularized TD with weight of 0 on Rφ and 1e-4 on Rw.