Unified Off-Policy Learning to Rank: a Reinforcement Learning Perspective

Authors: Zeyu Zhang, Yi Su, Hui Yuan, Yiran Wu, Rishab Balasubramanian, Qingyun Wu, Huazheng Wang, Mengdi Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate the performance of our proposed method CUOLR on several public datasets, compared with the state-of-the-art off-policy learning-to-rank methods.
Researcher Affiliation Collaboration Zeyu Zhang1 University of Science and Technology of China 2Google Deepmind 3Princeton University 4Penn State University 5Oregon State University
Pseudocode Yes Algorithm 1 Click Model-Agnostic Unified Off-policy Learning to Rank (with CQL) 1: Inputs: logged ranking data {(qi, Ri, ci(qi, Ri))}n i=1, length of the ranking K, batch size B, train iteration T. 2: Initialize: policy πξ and Q function Qθ, embedding model ϕψ( , ) 3: for t [T] do: 4: Randomly sample a batch of queries Q with size B. 5: Construct offline RL episodes T = {(si k, ai k, ri k)}K k=1 si k = ϕψ(Ri[: k], k), ai k := Ri[k], ri k := ci(qi, Ri)[k] for k [K] 6: Train the Q-net (and embedding model) with loss defined in equation (5) θ θ ηQ θLoss(θ, T ) ψ ψ ηϕ ψLoss(θ, T ) 7: Improve policy πξ (and embedding model) with SAC-style entropy regularization. ξ ξ + ηπEs T ,a πξ( |s) [Qθ(s, a) log πξ(a | s)] ψ ψ + ηϕEs T ,a πξ( |s) [Qθ(s, a) log πξ(a | s)] 8: end for 9: Output: learned ranking policy π ξ, Q function Q θ, embedding model ϕ ψ( , ) 10: Recover the optimal ranking from the learned policy π ξ using Definition 2.
Open Source Code Yes Codes: https://github.com/Zeyu Zhang1901/Unified-Off-Policy-LTR-Neurips2023
Open Datasets Yes We conduct semi-synthetic experiments on two traditional learning-to-rank benchmark datasets: MSLR-WEB10K and Yahoo! LETOR (set 1).
Dataset Splits Yes Both datasets come with the train-val-test split. The train data is used for generating logging policy and simulating clicks, with the validation data used for hyperparameter selection. And the final performance of the learned ranking policy is evaluated in the test data.
Hardware Specification No The paper mentions 'Google Cloud Research Credits Program' in the acknowledgments but does not provide specific details on the hardware (e.g., GPU/CPU models, memory) used for the experiments.
Software Dependencies No The paper mentions 'Adam optimizer' and 'Rank Lib learning to rank library' but does not specify version numbers for these or any other software dependencies.
Experiment Setup Yes For the embedding model in our method, we use multi-head attention with 8 heads. And for actors and critics in CQL and SAC algorithms, we utilize a 2-layer MLP with width 256 and ReLU activation. The conservative parameter α (marked red in Equation (5)) in CQL is set to 0.1. We use Adam for all methods with a tuned learning rate using the validation set.