Language Modelling via Learning to Rank
Authors: Arvid Frydenlund, Gagandeep Singh, Frank Rudzicz10636-10644
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that rank-based KD generally improves perplexity (PPL) often with statistical significance when compared to Kullback Leibler-based KD. Surprisingly, given the simplicity of the method, the N-grams act as competitive teachers and achieve similar performance as using either BERT or a Born-Again model as teachers. GPT-2 always acts as the best teacher, though, and using it and a Transformer-XL student on Wiki-02, rank-based KD reduces a cross-entropy baseline from 65.27 to 55.94 and against a KL-based KD of 56.70. |
| Researcher Affiliation | Collaboration | Arvid Frydenlund1,2, Gagandeep Singh3, Frank Rudzicz1,2,4 1 Department of Computer Science, University of Toronto 2 Vector Institute for Artificial Intelligence 3 Nuance Communications Inc. 4 Unity Health Toronto arvie@cs.toronto.edu, gagandeep.singh1@nuance.com, frank@cs.toronto.edu |
| Pseudocode | No | The paper describes methods like "N-gram Branching Set Construction" and "Plackett-Luce Rank Loss" in detailed prose, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like steps. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide links to a code repository. |
| Open Datasets | Yes | We use the word-level Penn Treebank (PTB) and Wiki Text02 (Wiki02) datasets and use ADW-LSTM (LSTM) and Transformer-XL (T-XL) students (Taylor, Marcus, and Santorini 2003; Merity et al. 2017; Merity, Keskar, and Socher 2018; Dai et al. 2019). |
| Dataset Splits | No | The paper mentions using "validation sets" for PTB and Wiki02 in Table 1 and Table 2, but it does not specify the exact split percentages or sample counts for training, validation, or test sets. It implies the use of standard splits but does not explicitly detail them. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as CPU or GPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper mentions various models (e.g., GPT-2, BERT, Transformer-XL, AWD-LSTM) and optimizers (Adam), but it does not specify the version numbers for any software dependencies, libraries, or frameworks used (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | See Appendix A for further experimental details. All students were trained using the AWD-LSTM training recipe (Merity et al. 2017) with the following modifications: a single learning rate (LR) of 3, a batch size of 80, using Adam (Kingma and Ba 2015) with an initial momentum of 0.7 which linearly warms up to 0.9. We also used weight decay (WD) of 0.1 for the LSTM and 0.01 for the T-XL. We trained for 40 epochs on PTB and 20 epochs on Wiki02. We cycled the interpolation weight, , (Clark et al. 2019) from 0.4 to 0.7. For the PL-s loss, we set the static = 0.5. |