Supervised Off-Policy Ranking
Authors: Yue Jin, Yue Zhang, Tao Qin, Xudong Zhang, Jian Yuan, Houqiang Li, Tie-Yan Liu
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on public datasets show that our method outperforms baseline methods in terms of rank correlation, regret value, and stability. |
| Researcher Affiliation | Collaboration | 1Department of Electronic Engineering, Tsinghua University, Beijing, China 2Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, China 3Microsoft Research Asia, Beijing, China. |
| Pseudocode | Yes | Algorithm 1 Training procedure of SOPR-T |
| Open Source Code | Yes | Our code is publicly available at Git Hub 1. 1https://github.com/SOPR-T/SOPR-T |
| Open Datasets | Yes | We evaluate SOPR-T and baseline OPE algorithms on D4RL datasets 5 (Fu et al., 2020) which are widely used in offline RL studies. 5https://github.com/rail-berkeley/d4rl |
| Dataset Splits | Yes | We randomly select 30 policies to form training policy set and another 10 policies to form validation policy set. The remaining 10 policies are used to form a test policy set. |
| Hardware Specification | Yes | Our experiments are run with a Nvidia Tesla P100 GPU. |
| Software Dependencies | No | The paper mentions 'Optimizer Adam' and uses libraries like SAC and d3rlpy, but it does not specify version numbers for any software dependencies, making the setup not fully reproducible from a software perspective. |
| Experiment Setup | Yes | Table 1 lists the configurations of our model and training process. Hyperparameter Value: Input linear projection layer ((dim_s+dim_a), 64), Low-level encoder n_layers=2, n_head=2, dim_feedforward=128, dropout=0.1, High-level encoder n_layers=6, n_head=8, dim_feedforward=512, dropout=0.1, Output linear projection layer (256, 1), Optimizer Adam, Learning rate 0.001, Batch size |Ds| = 16k, Number of clusters K = 256. |