Unified Off-Policy Learning to Rank: a Reinforcement Learning Perspective
Authors: Zeyu Zhang, Yi Su, Hui Yuan, Yiran Wu, Rishab Balasubramanian, Qingyun Wu, Huazheng Wang, Mengdi Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate the performance of our proposed method CUOLR on several public datasets, compared with the state-of-the-art off-policy learning-to-rank methods. |
| Researcher Affiliation | Collaboration | Zeyu Zhang1 University of Science and Technology of China 2Google Deepmind 3Princeton University 4Penn State University 5Oregon State University |
| Pseudocode | Yes | Algorithm 1 Click Model-Agnostic Unified Off-policy Learning to Rank (with CQL) 1: Inputs: logged ranking data {(qi, Ri, ci(qi, Ri))}n i=1, length of the ranking K, batch size B, train iteration T. 2: Initialize: policy πξ and Q function Qθ, embedding model ϕψ( , ) 3: for t [T] do: 4: Randomly sample a batch of queries Q with size B. 5: Construct offline RL episodes T = {(si k, ai k, ri k)}K k=1 si k = ϕψ(Ri[: k], k), ai k := Ri[k], ri k := ci(qi, Ri)[k] for k [K] 6: Train the Q-net (and embedding model) with loss defined in equation (5) θ θ ηQ θLoss(θ, T ) ψ ψ ηϕ ψLoss(θ, T ) 7: Improve policy πξ (and embedding model) with SAC-style entropy regularization. ξ ξ + ηπEs T ,a πξ( |s) [Qθ(s, a) log πξ(a | s)] ψ ψ + ηϕEs T ,a πξ( |s) [Qθ(s, a) log πξ(a | s)] 8: end for 9: Output: learned ranking policy π ξ, Q function Q θ, embedding model ϕ ψ( , ) 10: Recover the optimal ranking from the learned policy π ξ using Definition 2. |
| Open Source Code | Yes | Codes: https://github.com/Zeyu Zhang1901/Unified-Off-Policy-LTR-Neurips2023 |
| Open Datasets | Yes | We conduct semi-synthetic experiments on two traditional learning-to-rank benchmark datasets: MSLR-WEB10K and Yahoo! LETOR (set 1). |
| Dataset Splits | Yes | Both datasets come with the train-val-test split. The train data is used for generating logging policy and simulating clicks, with the validation data used for hyperparameter selection. And the final performance of the learned ranking policy is evaluated in the test data. |
| Hardware Specification | No | The paper mentions 'Google Cloud Research Credits Program' in the acknowledgments but does not provide specific details on the hardware (e.g., GPU/CPU models, memory) used for the experiments. |
| Software Dependencies | No | The paper mentions 'Adam optimizer' and 'Rank Lib learning to rank library' but does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | For the embedding model in our method, we use multi-head attention with 8 heads. And for actors and critics in CQL and SAC algorithms, we utilize a 2-layer MLP with width 256 and ReLU activation. The conservative parameter α (marked red in Equation (5)) in CQL is set to 0.1. We use Adam for all methods with a tuned learning rate using the validation set. |