Understanding and Leveraging Overparameterization in Recursive Value Estimation
Authors: Chenjun Xiao, Bo Dai, Jincheng Mei, Oscar A Ramirez, Ramki Gummadi, Chris Harris, Dale Schuurmans
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically we find that these regularizers dramatically improve the stability of TD and FVI, while allowing RM to match and even sometimes surpass their generalization performance with assured stability. |
| Researcher Affiliation | Collaboration | 1Google 2Department of Computing Science, University of Alberta |
| Pseudocode | No | The paper provides mathematical update equations for algorithms like RM, TD, and FVI (e.g., Eq. 5, 6, 9), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not include any explicit statement about releasing source code, provide a link to a code repository, or mention code availability in supplementary materials for the described methodology. |
| Open Datasets | Yes | We consider both discrete and continuous control benchmarks in this analysis. For the discrete action environments, we use DQN (Mnih et al., 2015) as the baseline algorithm to add our regularizers. For continuous control environments, we use QT-Opt (Kalashnikov et al., 2018) as the baseline algorithm... We provide extra experiment results on four Mujoco control problems... Half Cheetah, Hopper, Ant, and Walker2d. |
| Dataset Splits | No | The paper mentions using a 'fixed offline data set' and 'replay buffer with 10k tuples' but does not specify explicit training/validation/test dataset splits, percentages, or sample counts for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions using DQN and QT-Opt as baseline algorithms, but it does not list any specific software dependencies (e.g., programming languages, libraries, or solvers) with their version numbers required to replicate the experiment. |
| Experiment Setup | Yes | Appendix B.1 Acrobot: Replay buffer with 10k tuples sampled using a random policy across trajectories with maximum episode length of 64. A DQN with hidden units consisting of fully connected layers with (100, 100) units. Batch size 64. Learning rate of 1e-3. Regularized RM with weight of 2e-2 on Rφ and 1e-4 on Rw. Regularized TD with weight of 0 on Rφ and 1e-4 on Rw. |