Optimal Estimation of Policy Gradient via Double Fitted Iteration

Authors: Chengzhuo Ni, Ruiqi Zhang, Xiang Ji, Xuezhou Zhang, Mengdi Wang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we evaluate the performance of FPG on both policy gradient estimation and policy optimization, using either softmax tabular or Re LU policy networks. Under various metrics, our results show that FPG significantly outperforms existing off-policy PG estimation methods based on importance sampling and variance reduction techniques.
Researcher Affiliation Academia 1Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ, USA 2School of Mathematical Science, Peking University, Beijing, China.
Pseudocode Yes Algorithm 1 Fitted PG Algorithm
Open Source Code No The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper mentions using 'Open AI gym Frozen Lake and Cliff Walking environment' to generate datasets, but does not provide specific access information (URL, DOI, repository, or formal citation for a pre-existing public dataset) for the datasets generated or used.
Dataset Splits No The paper discusses the use of 'off-policy data' and 'offline logged data' but does not specify clear train/validation/test dataset splits, percentages, or methodology for partitioning data.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models, or cloud computing instance types.
Software Dependencies No The paper mentions software components like 'Open AI gym' and 'softmax tabular or Re LU policy networks' but does not specify version numbers for any libraries, frameworks, or other software dependencies.
Experiment Setup No The paper describes some aspects of the experimental setup, such as policy parameterization and environment modifications (e.g., 'adding artificial randomness for stochastic transitions... with probability 0.1'), but does not provide specific hyperparameters like learning rate, batch size, number of epochs, or optimizer settings.