Generalized test utilities for long-tail performance in extreme multi-label classification

Authors: Erik Schultheis, Marek Wydmuch, Wojciech Kotlowski, Rohit Babbar, Krzysztof Dembczynski

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To empirically test the introduced framework, we use popular benchmarks from the XMLC repository [6]. We train the LIGHTXML [18] model (with suggested default hyper-parameters) on provided training sets to obtain ˆη for all test instances. We then plug these estimates into different inference strategies and report the results across the discussed measures. To run the optimization algorithm efficiently, we use k = 100 or k = 1000 to pre-select for each instance the top k labels with the highest ˆηj as described in Section 6.3
Researcher Affiliation Collaboration Erik Schultheis Aalto University Helsinki, Finland erik.schultheis@aalto.fi Marek Wydmuch Poznan University of Technology Poznan, Poland mwydmuch@cs.put.poznan.pl Wojciech Kotłowski Poznan University of Technology Poznan, Poland wkotlowski@cs.put.poznan.pl Rohit Babbar University of Bath / Aalto University Bath, UK / Helsinki, Finland rb2608@bath.ac.uk Krzysztof Dembczy nski Yahoo! Research / Poznan University of Technology New York, USA / Poznan, Poland krzysztof.dembczynski@yahooinc.com
Pseudocode Yes Algorithm 1 BCA(X, ˆη, k, ϵ)
Open Source Code Yes A code to reproduce all the experiments: https://github.com/mwydmuch/x COLUMNs
Open Datasets Yes To empirically test the introduced framework, we use popular benchmarks from the XMLC repository [6].
Dataset Splits No The paper mentions using 'training sets' and 'test instances' for its experiments, and refers to a 'validation set' in the context of other frameworks ('The threshold tuning for PU is usually performed on a validation set'). However, it does not provide specific train/validation/test dataset splits (e.g., percentages or counts) for the datasets used in its own experimental evaluation.
Hardware Specification Yes The LIGHTXML model was trained on a workstation with a single Nvidia Tesla V100 GPU with 32 GB of memory and 64 GB of RAM. All the inference strategies were then run on the workstation with 64 GB of RAM.
Software Dependencies No The paper states: 'Please note that we implemented our algorithms in Python with some parts optimized using Numba [24] LLVM-based just-in-time (JIT) compiler for Python.' However, it does not specify version numbers for Python, Numba, or any other software dependencies.
Experiment Setup Yes We train the LIGHTXML [18] model (with suggested default hyper-parameters) on provided training sets to obtain ˆη for all test instances. We then plug these estimates into different inference strategies and report the results across the discussed measures. To run the optimization algorithm efficiently, we use k = 100 or k = 1000 to pre-select for each instance the top k labels with the highest ˆηj as described in Section 6.3