Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality

Authors: Tengyu Xu, Zhuoran Yang, Zhaoran Wang, Yingbin Liang

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct empirical experiments to answer the following two questions: (a) does the overall convergence of DROff-PAC doubly robust to function approximation errors as Theorem 2 & 3 indicate? (2) how does DR-Off-PAC compare with other off-policy methods?
Researcher Affiliation Academia 1Department of Electrical and Computer Engineering, The Ohio State University 2Departments of Industrial Engineering & Management Sciences, Northwestern University 3 Department of Operations Research and Financial Engineering, Princeton University.
Pseudocode Yes Algorithm 1 DR-Off-PAC Initialize: Policy parameter w0, and estimator parameters θq,0, θρ,0, θdq,0 and θψ,0. for t = 0, , T 1 do Obtain mini-batch samples Bt Dd and Bt,0 µ0 Critic I: Update density ratio and value function estimation via eq. (6): θq,t, θρ,t θq,t+1, θρ,t+1 Critic II: Update derivative of value function estimation via eq. (9): θdq,t θdq,t+1 Critic III: Update derivative of density ratio estimation via eq. (14): θψ,t θψ,t+1 Actor: Update policy parameter via eq. (15) wt+1 = wt + α 1 i Gi DR(wt) end for Output: w ˆT with ˆT chosen uniformly in {0, , T 1}
Open Source Code No No statement regarding the release of open-source code for the described methodology or a link to a code repository was found.
Open Datasets Yes We consider a variant of Baird s counterexample (Baird, 1995; Sutton & Barto, 2018) as shown in Figure 1.
Dataset Splits No The paper describes the initial and behavior distributions but does not provide specific training, validation, or test dataset splits or percentages.
Hardware Specification No No specific hardware details (GPU/CPU models, memory, or specific computing environments) used for running experiments are mentioned in the paper.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes In our experiments, we consider fixed learning rates 0.1, 0.5, 0.1, 0.05, 0.01 for updating w, θq, θψ, θdq, and θdρ, respectively, and we set the mini-batch size as N = 5.