reproducibilityindex.ai

Good Classification Measures and How to Find Them

Authors: Martijn Gösgens, Anton Zhiyanov, Aleksey Tikhonov, Liudmila Prokhorenkova

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To answer this question, we conduct a systematic analysis of classiﬁcation performance measures: we formally deﬁne a list of desirable properties and theoretically analyze which measures satisfy which properties. We also prove an impossibility theorem: some desirable properties cannot be simultaneously satisﬁed. Finally, we propose a new family of measures satisfying all desirable properties except one. This family includes the Matthews Correlation Coefﬁcient and a so-called Symmetric Balanced Accuracy that was not previously used in classiﬁcation literature. We believe that our systematic approach gives an important tool to practitioners for adequately evaluating classiﬁcation results. We also demonstrate through a series of experiments that different performance measures can be inconsistent in various situations.
Researcher Affiliation	Collaboration	Martijn Gösgens Eindhoven University of Technology Eindhoven, The Netherlands research@martijngosgens.nl Anton Zhiyanov Yandex Research, HSE University Moscow, Russia zhiyanovap@gmail.com Alexey Tikhonov Yandex Berlin, Germany altsoph@gmail.com Liudmila Prokhorenkova Yandex Research, HSE University, MIPT Moscow, Russia ostroumova-la@yandex.ru
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code for our experiments can be found on Git Hub.5 (footnote 5: https://github.com/yandex-research/classiﬁcation-measures)
Open Datasets	Yes	Image Net [24], a classic dataset for image classiﬁcation. ... take the 5-class Stanford Sentiment Treebank (SST-5) dataset [27].
Dataset Splits	No	The paper mentions using a "test set" for evaluation, but does not explicitly provide details about a validation set split or how it was used for reproducibility. For example, in Section 5.2 Image classification, it states: "apply the models to the test set".
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory) used for running the experiments. It only mentions general setups like "a model that predicts the presence/absence of precipitation" or "Image Net" experiments without hardware specifications.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies or libraries used in the experiments.
Experiment Setup	No	The paper mentions "six thresholds used in this experiment" for the weather forecasting service but lacks comprehensive details on hyperparameters, optimizer settings, or other system-level training configurations for the models used in the various experiments (e.g., ImageNet, SST-5).