reproducibilityindex.ai

Revisiting Mahalanobis Distance for Transformer-Based Out-of-Domain Detection

Authors: Alexander Podolskiy, Dmitry Lipin, Andrey Bout, Ekaterina Artemova, Irina Piontkovskaya13675-13682

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper conducts a thorough comparison of outof-domain intent detection methods. We evaluate multiple contextual encoders and methods, proven to be efﬁcient, on three standard datasets for intent classiﬁcation, expanded with out-of-domain utterances. Our main ﬁndings show that ﬁne-tuning Transformer-based encoders on in-domain data leads to superior results. Mahalanobis distance, together with utterance representations, derived from Transformer-based encoders, outperforms other methods by a wide margin (1-5% in terms of AUROC) and establishes new state-of-the-art results for all datasets.
Researcher Affiliation	Collaboration	Alexander Podolskiy1, Dmitry Lipin1, Andrey Bout1, Ekaterina Artemova1, 2, Irina Piontkovskaya1 1 Huawei Noah s Ark Lab, Moscow, Russia 2 HSE University, Moscow, Russia
Pseudocode	No	The paper describes the methods and calculations using equations and natural language, but it does not include any pseudocode or algorithm blocks.
Open Source Code	No	The paper states 'We perform our experiments with Py Torch1, Py Torch Lightning and Hugging Face Transformers library (Wolf et al. 2019).' These are external libraries used, not the authors' own source code for their methodology. There is no explicit statement or link indicating the release of their own source code.
Open Datasets	Yes	CLINC150 is an intent classiﬁcation dataset, modeling a real-life situation. Some utterances fall out of domains, covered by train data (Larson et al. 2019)... ROSTD extends the English part of multilingual dialog dataset with OOD utterances (Schuster et al. 2019; Gangal et al. 2020)... SNIPS has no explicit ID/OOD split. The total number of intents is 7. Following (Lin and Xu 2019) setup, we randomly split all labels into ID and OOD parts.
Dataset Splits	Yes	Table 1: Dataset statistics: CLINC150: Number train IND 15K, Number val IND 3K, Number test IND 4.5K. ROSTD: Number train IND 30K, Number val IND 4K, Number test IND 8.6K. SNIPS: Number train IND 13K, Number val IND 0.7K, Number test IND 0.7K. Also includes Number val OOD and Number test OOD.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running the experiments, such as CPU or GPU models.
Software Dependencies	Yes	We perform our experiments with Py Torch1, Py Torch Lightning and Hugging Face Transformers library (Wolf et al. 2019). 1Py Torch version 1.4.0, Py Torch Lightning version 0.7.5, Hugging Face Transformers version 2.8.0
Experiment Setup	No	The paper states, 'We tune hyper-parameters to maximize performance on the validation set for each of the ID intent classiﬁcation tasks.' However, it does not provide the specific values for these hyperparameters or other system-level training settings in the main text.