Generalized test utilities for long-tail performance in extreme multi-label classification
Authors: Erik Schultheis, Marek Wydmuch, Wojciech Kotlowski, Rohit Babbar, Krzysztof Dembczynski
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To empirically test the introduced framework, we use popular benchmarks from the XMLC repository [6]. We train the LIGHTXML [18] model (with suggested default hyper-parameters) on provided training sets to obtain ˆη for all test instances. We then plug these estimates into different inference strategies and report the results across the discussed measures. To run the optimization algorithm efficiently, we use k = 100 or k = 1000 to pre-select for each instance the top k labels with the highest ˆηj as described in Section 6.3 |
| Researcher Affiliation | Collaboration | Erik Schultheis Aalto University Helsinki, Finland erik.schultheis@aalto.fi Marek Wydmuch Poznan University of Technology Poznan, Poland mwydmuch@cs.put.poznan.pl Wojciech Kotłowski Poznan University of Technology Poznan, Poland wkotlowski@cs.put.poznan.pl Rohit Babbar University of Bath / Aalto University Bath, UK / Helsinki, Finland rb2608@bath.ac.uk Krzysztof Dembczy nski Yahoo! Research / Poznan University of Technology New York, USA / Poznan, Poland krzysztof.dembczynski@yahooinc.com |
| Pseudocode | Yes | Algorithm 1 BCA(X, ˆη, k, ϵ) |
| Open Source Code | Yes | A code to reproduce all the experiments: https://github.com/mwydmuch/x COLUMNs |
| Open Datasets | Yes | To empirically test the introduced framework, we use popular benchmarks from the XMLC repository [6]. |
| Dataset Splits | No | The paper mentions using 'training sets' and 'test instances' for its experiments, and refers to a 'validation set' in the context of other frameworks ('The threshold tuning for PU is usually performed on a validation set'). However, it does not provide specific train/validation/test dataset splits (e.g., percentages or counts) for the datasets used in its own experimental evaluation. |
| Hardware Specification | Yes | The LIGHTXML model was trained on a workstation with a single Nvidia Tesla V100 GPU with 32 GB of memory and 64 GB of RAM. All the inference strategies were then run on the workstation with 64 GB of RAM. |
| Software Dependencies | No | The paper states: 'Please note that we implemented our algorithms in Python with some parts optimized using Numba [24] LLVM-based just-in-time (JIT) compiler for Python.' However, it does not specify version numbers for Python, Numba, or any other software dependencies. |
| Experiment Setup | Yes | We train the LIGHTXML [18] model (with suggested default hyper-parameters) on provided training sets to obtain ˆη for all test instances. We then plug these estimates into different inference strategies and report the results across the discussed measures. To run the optimization algorithm efficiently, we use k = 100 or k = 1000 to pre-select for each instance the top k labels with the highest ˆηj as described in Section 6.3 |