Consistent algorithms for multi-label classification with macro-at-$k$ metrics
Authors: Erik Schultheis, Wojciech Kotlowski, Marek Wydmuch, Rohit Babbar, Strom Borman, Krzysztof Dembczynski
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results provide evidence for the competitive performance of the proposed approach. ... In this section, we empirically evaluate the proposed Frank-Wolfe algorithm on a variety of multi-label benchmark tasks... |
| Researcher Affiliation | Collaboration | Erik Schultheis Aalto University Helsinki, Finland erik.schultheis@aalto.fi Wojciech Kotłowski Poznan University of Technology Poznan, Poland wkotlowski@cs.put.poznan.pl Marek Wydmuch Poznan University of Technology Poznan, Poland mwydmuch@cs.put.poznan.pl Rohit Babbar University of Bath / Aalto University Bath, UK / Helsinki, Finland rb2608@bath.ac.uk Strom Borman Yahoo Research Champaign, USA strom.borman@yahooinc.com Krzysztof Dembczy nski Yahoo Research / Poznan University of Technology New York, USA / Poznan, Poland krzysztof.dembczynski@yahooinc.com |
| Pseudocode | Yes | Algorithm 1 Multi-label Frank-Wolfe algorithm for complex performance measures |
| Open Source Code | Yes | Code to reproduce the experiments: https://github.com/mwydmuch/xCOLUMNs |
| Open Datasets | Yes | In this section, we empirically evaluate the proposed Frank-Wolfe algorithm on a variety of multi-label benchmark tasks that differ substantially in the number of labels and imbalance of the label distribution: MEDIAMILL (Snoek et al., 2006), FLICKR (Tang & Liu, 2009), RCV1X (Lewis et al., 2004), and AMAZONCAT (Mc Auley & Leskovec, 2013; Bhatia et al., 2016). |
| Dataset Splits | Yes | In the beginning, we split the available training data into two subsets. One for estimating label probabilities bη, and one for tuning the actual classifier. ... we tested different ratios (50/50 or 75/25) of splitting training data into sets used for training the label probability estimators and estimating confusion matrix C, as well as a variant where we use the whole training set for both steps. ... Table 2: Results of different inference strategies on measure calculated at {3, 5, 10}. Notation: P precision, R recall, F1 F1-measure. ... MEDIAMILL (m = 101, ntrain = 30993, ntest = 12914,...) |
| Hardware Specification | Yes | All the experiments were conducted on a workstation with 64 GB of RAM and Nvidia V100 16Gb GPU. |
| Software Dependencies | No | The paper mentions 'implemented in Pytorch Paszke et al. (2019)' and 'We use Adam optimizer (Kingma & Ba, 2015)' but does not provide specific version numbers for Pytorch or Adam, only the publication years of the papers describing them. |
| Experiment Setup | Yes | For the first three datasets we use a multi-layer neural network for estimating bη(x). For the last and largest dataset, we use a sparse linear label tree model... We use k = 200 for both RCV1X and AMAZONCAT datasets. ... We tested different ratios (50/50 or 75/25) of splitting training data... We also investigated two strategies for initialization of classifier h by either using equal weights (resulting in a TOP-K classifier) or random weights. Finally, we terminate the algorithm if we do not observe sufficient improvement in the objective. In practice, we found that Frank-Wolfe converges within 3-10 iterations. |