reproducibilityindex.ai

Supervised Off-Policy Ranking

Authors: Yue Jin, Yue Zhang, Tao Qin, Xudong Zhang, Jian Yuan, Houqiang Li, Tie-Yan Liu

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on public datasets show that our method outperforms baseline methods in terms of rank correlation, regret value, and stability.
Researcher Affiliation	Collaboration	1Department of Electronic Engineering, Tsinghua University, Beijing, China 2Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, China 3Microsoft Research Asia, Beijing, China.
Pseudocode	Yes	Algorithm 1 Training procedure of SOPR-T
Open Source Code	Yes	Our code is publicly available at Git Hub 1. 1https://github.com/SOPR-T/SOPR-T
Open Datasets	Yes	We evaluate SOPR-T and baseline OPE algorithms on D4RL datasets 5 (Fu et al., 2020) which are widely used in offline RL studies. 5https://github.com/rail-berkeley/d4rl
Dataset Splits	Yes	We randomly select 30 policies to form training policy set and another 10 policies to form validation policy set. The remaining 10 policies are used to form a test policy set.
Hardware Specification	Yes	Our experiments are run with a Nvidia Tesla P100 GPU.
Software Dependencies	No	The paper mentions 'Optimizer Adam' and uses libraries like SAC and d3rlpy, but it does not specify version numbers for any software dependencies, making the setup not fully reproducible from a software perspective.
Experiment Setup	Yes	Table 1 lists the configurations of our model and training process. Hyperparameter Value: Input linear projection layer ((dim_s+dim_a), 64), Low-level encoder n_layers=2, n_head=2, dim_feedforward=128, dropout=0.1, High-level encoder n_layers=6, n_head=8, dim_feedforward=512, dropout=0.1, Output linear projection layer (256, 1), Optimizer Adam, Learning rate 0.001, Batch size \|Ds\| = 16k, Number of clusters K = 256.