Dual-Encoders for Extreme Multi-label Classification
Authors: Nilesh Gupta, Fnu Devvrit, Ankit Singh Rawat, Srinadh Bhojanapalli, Prateek Jain, Inderjit S Dhillon
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we compare DE models trained with the proposed loss functions to XMC methods. We show that the proposed loss functions indeed substantially improve the performance of DE models, even improving over SOTA in some settings. We also present ablations validating choices such as loss functions, number of negatives, and further provide a more exhaustive analysis in Section C of the Appendix |
| Researcher Affiliation | Collaboration | The University of Texas at Austin Google Google Research |
| Pseudocode | Yes | B.7 DISTRIBUTED log SOFTTOP-k IMPLEMENTATION: Below we provide the code snippet for efficient distributed implementation of log Soft Top-k operator (for loss computation it is more desirable to directly work in log domain) in Py Torch, here it is assumed that xs (the input to our operator is of shape B L/G where B is the batch size, L is the total number of labels and G is the number of GPUs used in the distributed setup. |
| Open Source Code | No | Our PyTorch based implementation is available at the following link. We plan to open-source the code after the review period along with the trained models to facilitate further research in these directions. |
| Open Datasets | Yes | We follow standard setup guidelines from the XMC repository (Bhatia et al., 2016) for LF-* datasets. For EURLex-4K, we adopt the XR-Transformer setup due to unavailable raw texts in the XMC repository version. Dataset statistics are in Table 6, and a summary is in Table 5. |
| Dataset Splits | Yes | Table 6: Dataset statistics: Dataset Num Train Points Num Test Points Num Labels Avg Labels per Point Avg Points per Label ... LF-Amazon Titles-1.3M 2,248,619 970,237 1,305,265 22.20 38.24 |
| Hardware Specification | Yes | We run our experiments on maximum 16 A100 GPU setup each having 40 GB GPU memory. |
| Software Dependencies | No | The paper mentions a "PyTorch based implementation" and uses `torch` in a code snippet, but it does not specify version numbers for PyTorch or any other software libraries or dependencies. |
| Experiment Setup | Yes | B.2 TRAINING HYPERPARAMETERS: In all of our experiments we train the dual encoders for 100 epochs with Adam W optimizer and use the linear decay with warmup learning rate schedule. Following the standard practices, we use 0 weight deacy for all non-gain parameters (such as layernorm, bias parameters) and 0.01 weight decay for all the rest of the model parameters. Rest of the hyperparameters considered in our experiments are described below: max_len: maximum length of the input to the transformer encoder, similar to (Dahiya et al., 2023a) we use 128 max_len for the long-text datasets (EURLex-4K and LFWikipedia-500K) and 32 max_len for short-text datasets (LF-Amazon Titles-131K and LF-Amazon Titles-1.3M) LR: learning rate of the model batch_size: batch-size of the mini-batches used during training dropout: dropout applied to the dual-encoder embeddings during training α: multiplicative factor to control steepness of σ function described in Section 4.3 η: micro-batch size hyperparameter that controls how many labels get processed at a time when using gradient caching τ: temperature hyperparameter used to scale similarity values (i.e. s(qi, dj)) during loss computation Table 7: Hyperparameters |