reproducibilityindex.ai

Language Modelling via Learning to Rank

Authors: Arvid Frydenlund, Gagandeep Singh, Frank Rudzicz10636-10644

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that rank-based KD generally improves perplexity (PPL) often with statistical signiﬁcance when compared to Kullback Leibler-based KD. Surprisingly, given the simplicity of the method, the N-grams act as competitive teachers and achieve similar performance as using either BERT or a Born-Again model as teachers. GPT-2 always acts as the best teacher, though, and using it and a Transformer-XL student on Wiki-02, rank-based KD reduces a cross-entropy baseline from 65.27 to 55.94 and against a KL-based KD of 56.70.
Researcher Affiliation	Collaboration	Arvid Frydenlund1,2, Gagandeep Singh3, Frank Rudzicz1,2,4 1 Department of Computer Science, University of Toronto 2 Vector Institute for Artiﬁcial Intelligence 3 Nuance Communications Inc. 4 Unity Health Toronto arvie@cs.toronto.edu, gagandeep.singh1@nuance.com, frank@cs.toronto.edu
Pseudocode	No	The paper describes methods like "N-gram Branching Set Construction" and "Plackett-Luce Rank Loss" in detailed prose, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like steps.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide links to a code repository.
Open Datasets	Yes	We use the word-level Penn Treebank (PTB) and Wiki Text02 (Wiki02) datasets and use ADW-LSTM (LSTM) and Transformer-XL (T-XL) students (Taylor, Marcus, and Santorini 2003; Merity et al. 2017; Merity, Keskar, and Socher 2018; Dai et al. 2019).
Dataset Splits	No	The paper mentions using "validation sets" for PTB and Wiki02 in Table 1 and Table 2, but it does not specify the exact split percentages or sample counts for training, validation, or test sets. It implies the use of standard splits but does not explicitly detail them.
Hardware Specification	No	The paper does not provide any specific details about the hardware used to run the experiments, such as CPU or GPU models, memory, or cloud instance types.
Software Dependencies	No	The paper mentions various models (e.g., GPT-2, BERT, Transformer-XL, AWD-LSTM) and optimizers (Adam), but it does not specify the version numbers for any software dependencies, libraries, or frameworks used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	See Appendix A for further experimental details. All students were trained using the AWD-LSTM training recipe (Merity et al. 2017) with the following modiﬁcations: a single learning rate (LR) of 3, a batch size of 80, using Adam (Kingma and Ba 2015) with an initial momentum of 0.7 which linearly warms up to 0.9. We also used weight decay (WD) of 0.1 for the LSTM and 0.01 for the T-XL. We trained for 40 epochs on PTB and 20 epochs on Wiki02. We cycled the interpolation weight, , (Clark et al. 2019) from 0.4 to 0.7. For the PL-s loss, we set the static = 0.5.