Revisiting Mahalanobis Distance for Transformer-Based Out-of-Domain Detection

Authors: Alexander Podolskiy, Dmitry Lipin, Andrey Bout, Ekaterina Artemova, Irina Piontkovskaya13675-13682

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper conducts a thorough comparison of outof-domain intent detection methods. We evaluate multiple contextual encoders and methods, proven to be efficient, on three standard datasets for intent classification, expanded with out-of-domain utterances. Our main findings show that fine-tuning Transformer-based encoders on in-domain data leads to superior results. Mahalanobis distance, together with utterance representations, derived from Transformer-based encoders, outperforms other methods by a wide margin (1-5% in terms of AUROC) and establishes new state-of-the-art results for all datasets.
Researcher Affiliation Collaboration Alexander Podolskiy1, Dmitry Lipin1, Andrey Bout1, Ekaterina Artemova1, 2, Irina Piontkovskaya1 1 Huawei Noah s Ark Lab, Moscow, Russia 2 HSE University, Moscow, Russia
Pseudocode No The paper describes the methods and calculations using equations and natural language, but it does not include any pseudocode or algorithm blocks.
Open Source Code No The paper states 'We perform our experiments with Py Torch1, Py Torch Lightning and Hugging Face Transformers library (Wolf et al. 2019).' These are external libraries used, not the authors' own source code for their methodology. There is no explicit statement or link indicating the release of their own source code.
Open Datasets Yes CLINC150 is an intent classification dataset, modeling a real-life situation. Some utterances fall out of domains, covered by train data (Larson et al. 2019)... ROSTD extends the English part of multilingual dialog dataset with OOD utterances (Schuster et al. 2019; Gangal et al. 2020)... SNIPS has no explicit ID/OOD split. The total number of intents is 7. Following (Lin and Xu 2019) setup, we randomly split all labels into ID and OOD parts.
Dataset Splits Yes Table 1: Dataset statistics: CLINC150: Number train IND 15K, Number val IND 3K, Number test IND 4.5K. ROSTD: Number train IND 30K, Number val IND 4K, Number test IND 8.6K. SNIPS: Number train IND 13K, Number val IND 0.7K, Number test IND 0.7K. Also includes Number val OOD and Number test OOD.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments, such as CPU or GPU models.
Software Dependencies Yes We perform our experiments with Py Torch1, Py Torch Lightning and Hugging Face Transformers library (Wolf et al. 2019). 1Py Torch version 1.4.0, Py Torch Lightning version 0.7.5, Hugging Face Transformers version 2.8.0
Experiment Setup No The paper states, 'We tune hyper-parameters to maximize performance on the validation set for each of the ID intent classification tasks.' However, it does not provide the specific values for these hyperparameters or other system-level training settings in the main text.