Navigating Extremes: Dynamic Sparsity in Large Output Spaces
Authors: Nasibullah Nasibullah, Erik Schultheis, Mike Lasby, Yani Ioannou, Rohit Babbar
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3 Experiments and discussion |
| Researcher Affiliation | Academia | 1Department of Computer Science, Aalto University, Helsinki, Finland {nasibullah.nasibullah, erik.schultheis, rohit.babbar}@aalto.fi 2Schulich School of Engineering, University of Calgary, Calgary, AB, Canada {mklasby, yani.ioannou}@ucalgary.ca 3Department of Computer Science, University of Bath, Bath, UK rb2608@bath.ac.uk |
| Pseudocode | No | The paper describes methods in prose but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/xmc-aalto/NeurIPS24-dst |
| Open Datasets | Yes | The datasets are publicly available at the Extreme Classification Repository3. http://manikvarma.org/downloads/XC/XMLRepository.html |
| Dataset Splits | No | Table 1 provides the total number of training instances (N) and test instances (N'), but no explicit details for a validation set split are provided in the paper. |
| Hardware Specification | Yes | While we want to demonstrate the memory efficiency of our algorithms, in order to enable meaningful comparison with existing methods, we run all our experiments on a NVidia A100 GPU, and measure the memory consumption using torch.cuda.max_memory_allocated. |
| Software Dependencies | No | The paper mentions software like PyTorch, CUDA kernels, and `torch.amp` but does not provide specific version numbers for these or other ancillary software components. |
| Experiment Setup | Yes | We present the hyperparametersettingsusedduringtrainingin Table8. Fortheencoderandclassifier, we employ two separate optimizers: Adam W for both components, except in the case of LF-Amazon Titles131K where Adam and SGD are utilized. All experiments are conducted using half-precision float16 types, except for Amazon-3M and LF-Amazon Titles-131K, which use the bfloat16 type. We apply a cosine scheduler with warmup, as specified in the table. The weight decay values are set separately: 0.01 for the encoder and 1.0e-4 for the final classification layer. We use the squared hinge loss function for all datasets except for LF-Amazon Titles-131K, where we use binary cross-entropy (BCE) loss with positive labels. Table 9: DST and other related hyperparameter settings for different datasets. |