Navigating Extremes: Dynamic Sparsity in Large Output Spaces

Authors: Nasibullah Nasibullah, Erik Schultheis, Mike Lasby, Yani Ioannou, Rohit Babbar

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 3 Experiments and discussion
Researcher Affiliation Academia 1Department of Computer Science, Aalto University, Helsinki, Finland {nasibullah.nasibullah, erik.schultheis, rohit.babbar}@aalto.fi 2Schulich School of Engineering, University of Calgary, Calgary, AB, Canada {mklasby, yani.ioannou}@ucalgary.ca 3Department of Computer Science, University of Bath, Bath, UK rb2608@bath.ac.uk
Pseudocode No The paper describes methods in prose but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/xmc-aalto/NeurIPS24-dst
Open Datasets Yes The datasets are publicly available at the Extreme Classification Repository3. http://manikvarma.org/downloads/XC/XMLRepository.html
Dataset Splits No Table 1 provides the total number of training instances (N) and test instances (N'), but no explicit details for a validation set split are provided in the paper.
Hardware Specification Yes While we want to demonstrate the memory efficiency of our algorithms, in order to enable meaningful comparison with existing methods, we run all our experiments on a NVidia A100 GPU, and measure the memory consumption using torch.cuda.max_memory_allocated.
Software Dependencies No The paper mentions software like PyTorch, CUDA kernels, and `torch.amp` but does not provide specific version numbers for these or other ancillary software components.
Experiment Setup Yes We present the hyperparametersettingsusedduringtrainingin Table8. Fortheencoderandclassifier, we employ two separate optimizers: Adam W for both components, except in the case of LF-Amazon Titles131K where Adam and SGD are utilized. All experiments are conducted using half-precision float16 types, except for Amazon-3M and LF-Amazon Titles-131K, which use the bfloat16 type. We apply a cosine scheduler with warmup, as specified in the table. The weight decay values are set separately: 0.01 for the encoder and 1.0e-4 for the final classification layer. We use the squared hinge loss function for all datasets except for LF-Amazon Titles-131K, where we use binary cross-entropy (BCE) loss with positive labels. Table 9: DST and other related hyperparameter settings for different datasets.