Good Classification Measures and How to Find Them
Authors: Martijn Gösgens, Anton Zhiyanov, Aleksey Tikhonov, Liudmila Prokhorenkova
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To answer this question, we conduct a systematic analysis of classification performance measures: we formally define a list of desirable properties and theoretically analyze which measures satisfy which properties. We also prove an impossibility theorem: some desirable properties cannot be simultaneously satisfied. Finally, we propose a new family of measures satisfying all desirable properties except one. This family includes the Matthews Correlation Coefficient and a so-called Symmetric Balanced Accuracy that was not previously used in classification literature. We believe that our systematic approach gives an important tool to practitioners for adequately evaluating classification results. We also demonstrate through a series of experiments that different performance measures can be inconsistent in various situations. |
| Researcher Affiliation | Collaboration | Martijn Gösgens Eindhoven University of Technology Eindhoven, The Netherlands research@martijngosgens.nl Anton Zhiyanov Yandex Research, HSE University Moscow, Russia zhiyanovap@gmail.com Alexey Tikhonov Yandex Berlin, Germany altsoph@gmail.com Liudmila Prokhorenkova Yandex Research, HSE University, MIPT Moscow, Russia ostroumova-la@yandex.ru |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for our experiments can be found on Git Hub.5 (footnote 5: https://github.com/yandex-research/classification-measures) |
| Open Datasets | Yes | Image Net [24], a classic dataset for image classification. ... take the 5-class Stanford Sentiment Treebank (SST-5) dataset [27]. |
| Dataset Splits | No | The paper mentions using a "test set" for evaluation, but does not explicitly provide details about a validation set split or how it was used for reproducibility. For example, in Section 5.2 Image classification, it states: "apply the models to the test set". |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory) used for running the experiments. It only mentions general setups like "a model that predicts the presence/absence of precipitation" or "Image Net" experiments without hardware specifications. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies or libraries used in the experiments. |
| Experiment Setup | No | The paper mentions "six thresholds used in this experiment" for the weather forecasting service but lacks comprehensive details on hyperparameters, optimizer settings, or other system-level training configurations for the models used in the various experiments (e.g., ImageNet, SST-5). |