Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Pretrained deep models outperform GBDTs in Learning-To-Rank under label scarcity

Authors: Charlie Hou, Kiran Koshy Thekumparampil, Michael Shavlovsky, Giulia Fanti, Yesh Dattatreya, sujay sanghavi

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In extensive experiments over both public and proprietary datasets, we show that pretrained DL rankers consistently outperform GBDT rankers on ranking metrics sometimes by as much as 38% both overall and on outliers.
Researcher Affiliation Collaboration Charlie Hou EMAIL Department of Electrical and Computer Engineering Carnegie Mellon University; Kiran K. Thekumparampil EMAIL Amazon, Palo Alto; Sujay Sanghavi EMAIL Department of Computer Science University of Texas at Austin
Pseudocode No The paper describes methods like Sim CLR and Sim CLR-Rank with mathematical formulas and textual descriptions, such as the Info NCE loss function, but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is provided at https://github.com/houcharlie/ltr-pretrain/.
Open Datasets Yes Our experiments are run on the three standard public datasets in the ranking literature: MSLRWEB30K (Qin & Liu, 2013), Yahoo (Chapelle & Chang, 2011), Istella_S (Lucchese et al., 2016), as well as an industry-scale proprietary ranking dataset.
Dataset Splits Yes For each of the three public datasets, we vary the fraction of labeled query groups in the training set in {0.001, 0.002, 0.005, 0.1, 0.5, 1.0}. Note that within each labeled query group all items are labeled. ... To simulate this scenario, we follow the methodology from Yang et al. (2022), to generate independent stochastic binary labels for each item from its true relevance label for training and validation sets of each of the public datasets. Note that we still use the true (latent) relevance labels in the test set for evaluation.
Hardware Specification Yes We report all results as averages over 3 trials, and use single V100 GPUs from a shared cluster to run the experiments.
Software Dependencies No The paper mentions software like 'lightgbm' and optimizers like 'Adam' and 'Adam W optimizer' but does not provide specific version numbers for any of these components.
Experiment Setup Yes Pretraining: (1) ... with learning rate 0.0005 using Adam (Kingma & Ba, 2014) ... (8) we pretrain for 300 epochs for all methods, and (9) we use a batch size of roughly 200000 items... Finetuning: (1) finetuning is done on the labeled train set by adding a three-layer MLP to the top of the pretrained model and training only this head for 100 epochs and then fully finetuning for 100 epochs using Adam with a learning rate of 5e-5, (2) we use an average batch size of roughly 1000 items (may vary based on query group size), (3) we use the Lambda Rank loss (Burges, 2010).