Hierarchical Text Classification as Sub-hierarchy Sequence Generation
Authors: SangHun Im, GiBaeg Kim, Heung-Seon Oh, Seongung Jo, Dong Hwan Kim
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Hi DEC achieved state-of-the-art performance with significantly fewer model parameters than existing models on benchmark datasets, such as RCV1-v2, NYT, and EURLEX57K. |
| Researcher Affiliation | Academia | School of Computer Science and Engineering, Korea University of Technology and Education (KOREATECH) {tkrhkshdqn, fk0214, ohhs, oowhat, hwan6615}@koreatech.ac.kr |
| Pseudocode | Yes | Algorithm 1: Recursive Hierarchy Decoding in Inference |
| Open Source Code | Yes | 1Code is available on https://github.com/SangHunIm/HiDEC |
| Open Datasets | Yes | For the standard evaluation, two small-scale datasets, RCV1-v2 (Lewis et al. 2004) and NYT (Sandhaus. 2008), and one large-scale dataset, EURLEX57K (Chalkidis et al. 2019), were chosen. |
| Dataset Splits | Yes | RCV1-v2 comprises 804,414 news documents, divided into 23,149 and 781,265 documents for training and testing, respectively, as benchmark splits. We randomly sampled 10% of the training data as the validation data for model selection. NYT comprises 36,471 news documents divided into 29,179 and 7,292 documents for training and testing, respectively. For a fair comparison, we followed the data configurations of previous work (Zhou et al. 2020; Chen et al. 2021). In particular, EURLEX57K is a large-scale hierarchy with 57,000 documents and 4,271 labels. Benchmark splits of 45,000, 6,000, and 6,000 were used for training, validation, and testing, respectively. |
| Hardware Specification | Yes | All models were implemented using Py Torch (Paszke et al. 2019) and trained using NVIDIA A6000. |
| Software Dependencies | No | The paper mentions 'Py Torch (Paszke et al. 2019)' for implementation but does not specify its version or other key software dependencies with their respective version numbers (e.g., Python version, CUDA version, other libraries). |
| Experiment Setup | Yes | The size of the hidden state was set to 300. The word embeddings in the text encoder were initialized using 300-dimensional Glo Ve (Pennington, Socher, and Manning 2014). For Hi DEC, a layer with two heads were used for both GRU-based encoder and BERT. The label and level embeddings with 300- and 768-dimension for the GRU-based encoder and BERT, respectively, were initialized using a normal distribution with µ=0 and σ=300 0.5. The hidden state size in the attentive layer was the same as the label embedding size. The FFN comprised two FC layers with 600- and 3,072-dimension feed-forward filter for the GRU-based encoder and BERT, respectively. The threshold for recursive hierarchy decoding was set to 0.5. A dropout with a probability of 0.5, 0.1, and 0.1 was applied to the embedding layer and behind every FFN and attention, respectively. For optimization, Adam optimizer (Kingma and Ba 2015) was utilized with learning rate lr=1e-4, β1=0.9, β2=0.999, and eps=1e-8. The size of the mini-batch was set to 256 for GRU-based models. With BERT as a text encoder model, we set lr and the mini-batch size to 5e-5 and 64, respectively. The lr was controlled using a linear schedule with a warmup rate of 0.1. Gradient clipping with a maximum gradient norm of 1.0 was performed to prevent gradient overflow. |