reproducibilityindex.ai

Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality

Authors: Tengyu Xu, Zhuoran Yang, Zhaoran Wang, Yingbin Liang

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct empirical experiments to answer the following two questions: (a) does the overall convergence of DROff-PAC doubly robust to function approximation errors as Theorem 2 & 3 indicate? (2) how does DR-Off-PAC compare with other off-policy methods?
Researcher Affiliation	Academia	1Department of Electrical and Computer Engineering, The Ohio State University 2Departments of Industrial Engineering & Management Sciences, Northwestern University 3 Department of Operations Research and Financial Engineering, Princeton University.
Pseudocode	Yes	Algorithm 1 DR-Off-PAC Initialize: Policy parameter w0, and estimator parameters θq,0, θρ,0, θdq,0 and θψ,0. for t = 0, , T 1 do Obtain mini-batch samples Bt Dd and Bt,0 µ0 Critic I: Update density ratio and value function estimation via eq. (6): θq,t, θρ,t θq,t+1, θρ,t+1 Critic II: Update derivative of value function estimation via eq. (9): θdq,t θdq,t+1 Critic III: Update derivative of density ratio estimation via eq. (14): θψ,t θψ,t+1 Actor: Update policy parameter via eq. (15) wt+1 = wt + α 1 i Gi DR(wt) end for Output: w ˆT with ˆT chosen uniformly in {0, , T 1}
Open Source Code	No	No statement regarding the release of open-source code for the described methodology or a link to a code repository was found.
Open Datasets	Yes	We consider a variant of Baird s counterexample (Baird, 1995; Sutton & Barto, 2018) as shown in Figure 1.
Dataset Splits	No	The paper describes the initial and behavior distributions but does not provide specific training, validation, or test dataset splits or percentages.
Hardware Specification	No	No specific hardware details (GPU/CPU models, memory, or specific computing environments) used for running experiments are mentioned in the paper.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	In our experiments, we consider ﬁxed learning rates 0.1, 0.5, 0.1, 0.05, 0.01 for updating w, θq, θψ, θdq, and θdρ, respectively, and we set the mini-batch size as N = 5.