Toward Understanding Privileged Features Distillation in Learning-to-Rank
Authors: Shuo Yang, Sujay Sanghavi, Holakou Rahmanian, Jan Bakus, Vishwanathan S. V. N.
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we first study PFD empirically on three public ranking datasets and an industrial-scale ranking problem derived from Amazon s logs. We show that PFD outperforms several baselines (no-distillation, pretraining-finetuning, self-distillation, and generalized distillation) on all these datasets. Next, we analyze why and when PFD performs well via both empirical ablation studies and theoretical analysis for linear models. |
| Researcher Affiliation | Collaboration | UT Austin yangshuo ut@utexas.edu Sujay Sanghavi Amazon sujayrs@amazon.com Holakou Rahmanian Amazon holakou@amazon.com Amazon jbakus@amazon.com S.V.N. Vishwanathan Amazon vishy@amazon.com |
| Pseudocode | No | The paper describes the steps for Privileged Features Distillation in Section 3.1 (Step I and Step II), detailing the process. However, this description is in paragraph form and is not presented as a formally structured pseudocode block or algorithm. |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See supplemental material. |
| Open Datasets | Yes | We first evaluate the performance of PFD on three widely used public ranking datasets. Specifically, we use the Set1 from Yahoo! Learn to rank challenge [CC11]; Istella Learning to Rank dataset [DLN+16]; and Microsoft Learning to Rank MSLR-Web30k dataset [QL13]. |
| Dataset Splits | Yes | The validation set is from the training set, taking 10% from the training data for Yahoo, Istella, and Web30k. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running its experiments, such as specific GPU models, CPU types, or cloud computing instances. The authors' checklist explicitly states 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No]'. |
| Software Dependencies | No | The paper mentions using PyTorch for implementation and the Adam optimizer, and refers to specific loss functions like Rank BCE and Rank Net. However, it does not provide specific version numbers for PyTorch or any other software libraries or dependencies used in the experiments. |
| Experiment Setup | Yes | The ranking model is a 5-layer fully connected neural network with hidden dimensions [256, 128, 64, 32, 16]. Adam optimizer [KB14] is used with learning rate 1e-4 and batch size 256. The training lasts for 100 epochs. The best checkpoint (measured by testing NDCG@8) on the validation dataset is used for evaluation. |