reproducibilityindex.ai

Unified Off-Policy Learning to Rank: a Reinforcement Learning Perspective

Authors: Zeyu Zhang, Yi Su, Hui Yuan, Yiran Wu, Rishab Balasubramanian, Qingyun Wu, Huazheng Wang, Mengdi Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically evaluate the performance of our proposed method CUOLR on several public datasets, compared with the state-of-the-art off-policy learning-to-rank methods.
Researcher Affiliation	Collaboration	Zeyu Zhang1 University of Science and Technology of China 2Google Deepmind 3Princeton University 4Penn State University 5Oregon State University
Pseudocode	Yes	Algorithm 1 Click Model-Agnostic Unified Off-policy Learning to Rank (with CQL) 1: Inputs: logged ranking data {(qi, Ri, ci(qi, Ri))}n i=1, length of the ranking K, batch size B, train iteration T. 2: Initialize: policy πξ and Q function Qθ, embedding model ϕψ( , ) 3: for t [T] do: 4: Randomly sample a batch of queries Q with size B. 5: Construct offline RL episodes T = {(si k, ai k, ri k)}K k=1 si k = ϕψ(Ri[: k], k), ai k := Ri[k], ri k := ci(qi, Ri)[k] for k [K] 6: Train the Q-net (and embedding model) with loss defined in equation (5) θ θ ηQ θLoss(θ, T ) ψ ψ ηϕ ψLoss(θ, T ) 7: Improve policy πξ (and embedding model) with SAC-style entropy regularization. ξ ξ + ηπEs T ,a πξ( \|s) [Qθ(s, a) log πξ(a \| s)] ψ ψ + ηϕEs T ,a πξ( \|s) [Qθ(s, a) log πξ(a \| s)] 8: end for 9: Output: learned ranking policy π ξ, Q function Q θ, embedding model ϕ ψ( , ) 10: Recover the optimal ranking from the learned policy π ξ using Definition 2.
Open Source Code	Yes	Codes: https://github.com/Zeyu Zhang1901/Unified-Off-Policy-LTR-Neurips2023
Open Datasets	Yes	We conduct semi-synthetic experiments on two traditional learning-to-rank benchmark datasets: MSLR-WEB10K and Yahoo! LETOR (set 1).
Dataset Splits	Yes	Both datasets come with the train-val-test split. The train data is used for generating logging policy and simulating clicks, with the validation data used for hyperparameter selection. And the final performance of the learned ranking policy is evaluated in the test data.
Hardware Specification	No	The paper mentions 'Google Cloud Research Credits Program' in the acknowledgments but does not provide specific details on the hardware (e.g., GPU/CPU models, memory) used for the experiments.
Software Dependencies	No	The paper mentions 'Adam optimizer' and 'Rank Lib learning to rank library' but does not specify version numbers for these or any other software dependencies.
Experiment Setup	Yes	For the embedding model in our method, we use multi-head attention with 8 heads. And for actors and critics in CQL and SAC algorithms, we utilize a 2-layer MLP with width 256 and ReLU activation. The conservative parameter α (marked red in Equation (5)) in CQL is set to 0.1. We use Adam for all methods with a tuned learning rate using the validation set.