Optimal Estimation of Policy Gradient via Double Fitted Iteration
Authors: Chengzhuo Ni, Ruiqi Zhang, Xiang Ji, Xuezhou Zhang, Mengdi Wang
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we evaluate the performance of FPG on both policy gradient estimation and policy optimization, using either softmax tabular or Re LU policy networks. Under various metrics, our results show that FPG significantly outperforms existing off-policy PG estimation methods based on importance sampling and variance reduction techniques. |
| Researcher Affiliation | Academia | 1Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ, USA 2School of Mathematical Science, Peking University, Beijing, China. |
| Pseudocode | Yes | Algorithm 1 Fitted PG Algorithm |
| Open Source Code | No | The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The paper mentions using 'Open AI gym Frozen Lake and Cliff Walking environment' to generate datasets, but does not provide specific access information (URL, DOI, repository, or formal citation for a pre-existing public dataset) for the datasets generated or used. |
| Dataset Splits | No | The paper discusses the use of 'off-policy data' and 'offline logged data' but does not specify clear train/validation/test dataset splits, percentages, or methodology for partitioning data. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models, or cloud computing instance types. |
| Software Dependencies | No | The paper mentions software components like 'Open AI gym' and 'softmax tabular or Re LU policy networks' but does not specify version numbers for any libraries, frameworks, or other software dependencies. |
| Experiment Setup | No | The paper describes some aspects of the experimental setup, such as policy parameterization and environment modifications (e.g., 'adding artificial randomness for stochastic transitions... with probability 0.1'), but does not provide specific hyperparameters like learning rate, batch size, number of epochs, or optimizer settings. |