Are Neural Rankers still Outperformed by Gradient Boosted Decision Trees?

Authors: Zhen Qin, Le Yan, Honglei Zhuang, Yi Tay, Rama Kumar Pasumarthi, Xuanhui Wang, Michael Bendersky, Marc Najork

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on three widely used public LTR datasets. Our neural models are trained with listwise ranking losses. On all datasets, our framework can outperform recent neural LTR methods by a large margin. When comparing with the strong Lambda MART implementation, λMARTGBM, we are able to achieve equally good results, if not better. We compare a comprehensive list of methods in Table 2. Ablation study. We provide some ablation study results in Table 4 to highlight the effectiveness of each component in our framework.
Researcher Affiliation Industry Zhen Qin, Le Yan, Honglei Zhuang, Yi Tay, Rama Kumar Pasumarthi, Xuanhui Wang, Michael Bendersky, Marc Najork Google Research {zhenqin,lyyanle,hlz,yitay,ramakumar,xuanhui,bemike,najork}@google.com
Pseudocode No No pseudocode or algorithm blocks are explicitly labeled or formatted as such.
Open Source Code No We are in the process to release the code and trained models in an open-sourced software package.
Open Datasets Yes The three data sets we used in our experiments are public benchmark datasets widely adopted by the research community. They are the LETOR dataset from Microsoft (Qin & Liu, 2013), Set1 from the YAHOO LTR challenge (Chapelle & Chang, 2011), and Istella (Dato et al., 2016).
Dataset Splits Yes Table 1: The statistics of the three largest public benchmark datasets for LTR models. #queries #docs training validation test training validation test Web30K 18,919 6,306 6,306 2,270,296 747,218 753,611 Yahoo 19,944 2,994 6,983 473,134 71,083 165,660 Istella 20,901 2,318 9,799 6,587,822 737,803 3,129,004
Hardware Specification No No specific hardware details are mentioned in the paper.
Software Dependencies No For all our experiments using neural network approaches, we implemented them using the TF-Ranking (Pasumarthi et al., 2019) library. (No version specified for TF-Ranking or any other library).
Experiment Setup Yes For λMARTGBM, we do a grid search for number of trees {300, 500, 1000}, number of leaves {200, 500, 1000}, and learning rate {0.01, 0.05, 0.1, 0.5}. For our neural models the main hyperparameters are hidden layer size {256, 512, 1024, 2048, 3072, 4096} and number of layers {3, 4, 5, 6} for regular DNN, data augmentation noise [0, 5.0] using binary search with step 0.1, number of attention layers {3, 4, 5, 6}, and number of attention heads {2, 3, 4, 5}. We apply a simple log1p transformation to every element of x and empirically find it works well for the Web30K and Istella datasets. We report all results based on the softmax cross entropy loss l(y, s(x)) = Pn i=1 yi loge esi Pj esj since it is simple and empirically robust in general, as demonstrated in Appendix B.2.