Variational Latent Branching Model for Off-Policy Evaluation

Authors: Qitong Gao, Ge Gao, Min Chi, Miroslav Pajic

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness of the VLBM is evaluated on the deep OPE (DOPE) benchmark, from which the training trajectories are designed to result in varied coverage of the state-action space. We show that the VLBM outperforms existing state-of-the-art OPE methods in general.
Researcher Affiliation Academia Duke University, USA. Emails: {qitong.gao, miroslav.pajic}@duke.edu. North Carolina State University, USA. Emails: {ggao5, mchi}@ncsu.edu
Pseudocode Yes Pseudo-code for training and evaluating the VLBM can be found in Appendix C.
Open Source Code Yes Code available at https://github.com/gaoqitong/vlbm.
Open Datasets Yes Specifically, it utilizes existing environments and training trajectories provided by D4RL7 and RLUnplugged8, which are two benchmark suites for offline RL training, and additionally provide target policies for OPE methods to evaluate.
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits with percentages, sample counts, or specific pre-defined split references for the data used in the experiments.
Hardware Specification Yes Training of the proposed method, and baselines, are facilitated by Nvidia Quadro RTX 6000, NVIDIA RTX A5000, and NVIDIA TITAN XP GPUs.
Software Dependencies No The paper mentions 'Adam optimizer' but does not provide version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes Moreover, we consider the decoder to have B = 10 branches, i.e., {pϕ1, . . . , pϕ10}. The dimension of latent space is set to be 16, i.e., z Z R16. Other implementation details can be found in Appendix A. ... Specifically, for each task we train an ensemble of 10 AR models, for fair comparisons against VLBM which leverages the branching architecture; see Appendix A for details of the AR ensemble setup. ... max_iter in Alg. 1 set to 1,000 and minibatch size set to 64. Adam optimizer is used to perform gradient descent. To determine the learning rate, we perform grid search among {0.003, 0.001, 0.0007, 0.0005, 0.0003, 0.0001, 0.00005}. Exponential decay is applied to the learning rate, which decays the learning rate by 0.997 every iteration. To train VLBM, we set the constants from equation 10 following C1 = C2, and perform grid search among {5, 1, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0001}. To train VLM+RSA, the constant C from equation 8 is determined by grid search among the same set of parameters above. L2-regularization with decay of 0.001 and batch normalization are applied to all hidden layers.