Fast Multi-Resolution Transformer Fine-tuning for Extreme Multi-label Text Classification

Authors: Jiong Zhang, Wei-Cheng Chang, Hsiang-Fu Yu, Inderjit Dhillon

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results show that XR-Transformer takes significantly less training time compared to other transformer-based XMC models while yielding better state-of-the-art results. In particular, on the public Amazon-3M dataset with 3 million labels, XR-Transformer is not only 20x faster than X-Transformer but also improves the Precision@1 from 51% to 54%.
Researcher Affiliation Collaboration Jiong Zhang Amazon jiongz@amazon.com Wei-cheng Chang Amazon chanweic@amazon.com Hsiang-fu Yu Amazon rofu.yu@gmail.com Inderjit S. Dhillon UT Austin & Amazon inderjit@cs.utexas.edu
Pseudocode Yes Algorithm 1: Iterative_Learn(X, Y, C, θ, P) ... Algorithm 2: XR-Transformer training
Open Source Code Yes Our code is publicly available at https://github.com/amzn/pecos.
Open Datasets Yes We evaluate XR-Transformer on 6 public XMC benchmarking datasets: Eurlex-4K, Wiki10-31K, Amazon Cat-13K, Wiki-500K, Amazon-670K, Amazon-3M. ... These six publicly available benchmark datasets, including the sparse TF-IDF features are downloaded from https://github.com/yourh/Attention XML which are the same as Attention XML [8] X-Transformer [12] and Light XML [13] for fair comparison.
Dataset Splits Yes For fair comparison, we use the same raw text input, sparse feature representations and same train-test split as Attention XML [8] and other latest works [12, 13].
Hardware Specification Yes all the experiments are conducted with float32 precision on AWS p3.16xlarge instance with 8 Nvidia V100 GPUs
Software Dependencies No No specific software dependencies with version numbers were explicitly mentioned for reproducibility.
Experiment Setup Yes The hyper-parameter of XR-Transformer and more empirical results are included in Appendix A.3. The proposed XR-Transformer follows Attention XML and Light XML to use an ensemble of 3 models, while X-Transformer uses an ensemble of 9 models [12]. More details about the ensemble setting can be found in Appendix A.3.