Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality
Authors: Tengyu Xu, Zhuoran Yang, Zhaoran Wang, Yingbin Liang
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct empirical experiments to answer the following two questions: (a) does the overall convergence of DROff-PAC doubly robust to function approximation errors as Theorem 2 & 3 indicate? (2) how does DR-Off-PAC compare with other off-policy methods? |
| Researcher Affiliation | Academia | 1Department of Electrical and Computer Engineering, The Ohio State University 2Departments of Industrial Engineering & Management Sciences, Northwestern University 3 Department of Operations Research and Financial Engineering, Princeton University. |
| Pseudocode | Yes | Algorithm 1 DR-Off-PAC Initialize: Policy parameter w0, and estimator parameters θq,0, θρ,0, θdq,0 and θψ,0. for t = 0, , T 1 do Obtain mini-batch samples Bt Dd and Bt,0 µ0 Critic I: Update density ratio and value function estimation via eq. (6): θq,t, θρ,t θq,t+1, θρ,t+1 Critic II: Update derivative of value function estimation via eq. (9): θdq,t θdq,t+1 Critic III: Update derivative of density ratio estimation via eq. (14): θψ,t θψ,t+1 Actor: Update policy parameter via eq. (15) wt+1 = wt + α 1 i Gi DR(wt) end for Output: w ˆT with ˆT chosen uniformly in {0, , T 1} |
| Open Source Code | No | No statement regarding the release of open-source code for the described methodology or a link to a code repository was found. |
| Open Datasets | Yes | We consider a variant of Baird s counterexample (Baird, 1995; Sutton & Barto, 2018) as shown in Figure 1. |
| Dataset Splits | No | The paper describes the initial and behavior distributions but does not provide specific training, validation, or test dataset splits or percentages. |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, memory, or specific computing environments) used for running experiments are mentioned in the paper. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | In our experiments, we consider fixed learning rates 0.1, 0.5, 0.1, 0.05, 0.01 for updating w, θq, θψ, θdq, and θdρ, respectively, and we set the mini-batch size as N = 5. |