Supervised Off-Policy Ranking

Authors: Yue Jin, Yue Zhang, Tao Qin, Xudong Zhang, Jian Yuan, Houqiang Li, Tie-Yan Liu

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on public datasets show that our method outperforms baseline methods in terms of rank correlation, regret value, and stability.
Researcher Affiliation Collaboration 1Department of Electronic Engineering, Tsinghua University, Beijing, China 2Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, China 3Microsoft Research Asia, Beijing, China.
Pseudocode Yes Algorithm 1 Training procedure of SOPR-T
Open Source Code Yes Our code is publicly available at Git Hub 1. 1https://github.com/SOPR-T/SOPR-T
Open Datasets Yes We evaluate SOPR-T and baseline OPE algorithms on D4RL datasets 5 (Fu et al., 2020) which are widely used in offline RL studies. 5https://github.com/rail-berkeley/d4rl
Dataset Splits Yes We randomly select 30 policies to form training policy set and another 10 policies to form validation policy set. The remaining 10 policies are used to form a test policy set.
Hardware Specification Yes Our experiments are run with a Nvidia Tesla P100 GPU.
Software Dependencies No The paper mentions 'Optimizer Adam' and uses libraries like SAC and d3rlpy, but it does not specify version numbers for any software dependencies, making the setup not fully reproducible from a software perspective.
Experiment Setup Yes Table 1 lists the configurations of our model and training process. Hyperparameter Value: Input linear projection layer ((dim_s+dim_a), 64), Low-level encoder n_layers=2, n_head=2, dim_feedforward=128, dropout=0.1, High-level encoder n_layers=6, n_head=8, dim_feedforward=512, dropout=0.1, Output linear projection layer (256, 1), Optimizer Adam, Learning rate 0.001, Batch size |Ds| = 16k, Number of clusters K = 256.