reproducibilityindex.ai

Generalized test utilities for long-tail performance in extreme multi-label classification

Authors: Erik Schultheis, Marek Wydmuch, Wojciech Kotlowski, Rohit Babbar, Krzysztof Dembczynski

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To empirically test the introduced framework, we use popular benchmarks from the XMLC repository [6]. We train the LIGHTXML [18] model (with suggested default hyper-parameters) on provided training sets to obtain ˆη for all test instances. We then plug these estimates into different inference strategies and report the results across the discussed measures. To run the optimization algorithm efficiently, we use k = 100 or k = 1000 to pre-select for each instance the top k labels with the highest ˆηj as described in Section 6.3
Researcher Affiliation	Collaboration	Erik Schultheis Aalto University Helsinki, Finland erik.schultheis@aalto.fi Marek Wydmuch Poznan University of Technology Poznan, Poland mwydmuch@cs.put.poznan.pl Wojciech Kotłowski Poznan University of Technology Poznan, Poland wkotlowski@cs.put.poznan.pl Rohit Babbar University of Bath / Aalto University Bath, UK / Helsinki, Finland rb2608@bath.ac.uk Krzysztof Dembczy nski Yahoo! Research / Poznan University of Technology New York, USA / Poznan, Poland krzysztof.dembczynski@yahooinc.com
Pseudocode	Yes	Algorithm 1 BCA(X, ˆη, k, ϵ)
Open Source Code	Yes	A code to reproduce all the experiments: https://github.com/mwydmuch/x COLUMNs
Open Datasets	Yes	To empirically test the introduced framework, we use popular benchmarks from the XMLC repository [6].
Dataset Splits	No	The paper mentions using 'training sets' and 'test instances' for its experiments, and refers to a 'validation set' in the context of other frameworks ('The threshold tuning for PU is usually performed on a validation set'). However, it does not provide specific train/validation/test dataset splits (e.g., percentages or counts) for the datasets used in its own experimental evaluation.
Hardware Specification	Yes	The LIGHTXML model was trained on a workstation with a single Nvidia Tesla V100 GPU with 32 GB of memory and 64 GB of RAM. All the inference strategies were then run on the workstation with 64 GB of RAM.
Software Dependencies	No	The paper states: 'Please note that we implemented our algorithms in Python with some parts optimized using Numba [24] LLVM-based just-in-time (JIT) compiler for Python.' However, it does not specify version numbers for Python, Numba, or any other software dependencies.
Experiment Setup	Yes	We train the LIGHTXML [18] model (with suggested default hyper-parameters) on provided training sets to obtain ˆη for all test instances. We then plug these estimates into different inference strategies and report the results across the discussed measures. To run the optimization algorithm efficiently, we use k = 100 or k = 1000 to pre-select for each instance the top k labels with the highest ˆηj as described in Section 6.3