Toward Understanding Privileged Features Distillation in Learning-to-Rank

Authors: Shuo Yang, Sujay Sanghavi, Holakou Rahmanian, Jan Bakus, Vishwanathan S. V. N.

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we first study PFD empirically on three public ranking datasets and an industrial-scale ranking problem derived from Amazon s logs. We show that PFD outperforms several baselines (no-distillation, pretraining-finetuning, self-distillation, and generalized distillation) on all these datasets. Next, we analyze why and when PFD performs well via both empirical ablation studies and theoretical analysis for linear models.
Researcher Affiliation Collaboration UT Austin yangshuo ut@utexas.edu Sujay Sanghavi Amazon sujayrs@amazon.com Holakou Rahmanian Amazon holakou@amazon.com Amazon jbakus@amazon.com S.V.N. Vishwanathan Amazon vishy@amazon.com
Pseudocode No The paper describes the steps for Privileged Features Distillation in Section 3.1 (Step I and Step II), detailing the process. However, this description is in paragraph form and is not presented as a formally structured pseudocode block or algorithm.
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See supplemental material.
Open Datasets Yes We first evaluate the performance of PFD on three widely used public ranking datasets. Specifically, we use the Set1 from Yahoo! Learn to rank challenge [CC11]; Istella Learning to Rank dataset [DLN+16]; and Microsoft Learning to Rank MSLR-Web30k dataset [QL13].
Dataset Splits Yes The validation set is from the training set, taking 10% from the training data for Yahoo, Istella, and Web30k.
Hardware Specification No The paper does not specify any particular hardware used for running its experiments, such as specific GPU models, CPU types, or cloud computing instances. The authors' checklist explicitly states 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No]'.
Software Dependencies No The paper mentions using PyTorch for implementation and the Adam optimizer, and refers to specific loss functions like Rank BCE and Rank Net. However, it does not provide specific version numbers for PyTorch or any other software libraries or dependencies used in the experiments.
Experiment Setup Yes The ranking model is a 5-layer fully connected neural network with hidden dimensions [256, 128, 64, 32, 16]. Adam optimizer [KB14] is used with learning rate 1e-4 and batch size 256. The training lasts for 100 epochs. The best checkpoint (measured by testing NDCG@8) on the validation dataset is used for evaluation.