PEACH: Pretrained-Embedding Explanation across Contextual and Hierarchical Structure

Authors: Feiqi Cao, Soyeon Caren Han, Hyunsuk Chung

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using the proposed PEACH, we perform a comprehensive analysis of several contextual embeddings on nine different NLP text classification benchmarks.
Researcher Affiliation Collaboration Feiqi Cao1 , Soyeon Caren Han1,2 and Hyunsuk Chung3 1School of Computer Science, University of Sydney 2School of Computing and Information Systems, University of Melbourne 3Fortify Edge
Pseudocode No No pseudocode or algorithm blocks are provided.
Open Source Code Yes Code and Appendix are in https://github.com/adlnlp/peach.
Open Datasets Yes We evaluate PEACH with 5 state-of-the-art PLMs on 9 benchmark datasets. Those datasets encompass five text classification tasks... Microsoft Research Paraphrase (MSRP) [Dolan et al., 2004]... Sentences Involving Compositional Knowledge (SICK) [Marelli et al., 2014]... Standford Sentiment Treebank (SST2) [Socher et al., 2013]... The MR [Pang and Lee, 2005]... IMDB [Maas et al., 2011]...
Dataset Splits Yes The training set contains 4076 sentence pairs and 1725 testing pairs for generating decision trees. During PLM finetuning, we randomly split the training set with a 9:1 ratio so 3668 pairs are used for training and 408 pairs are used for validation.The dataset has 4439, 495, and 4906 pairs for training, validation and testing sets.SST2) [Socher et al., 2013] has 6920, 872 and 1821 documents for training, validation and testing.For MR and IMDB, since no official validation split is provided, the training sets are randomly split into a 9:1 ratio to obtain a validation set for finetuning PLMs.
Hardware Specification Yes All experiments are conducted on Google Colab with Intel(R) Xeon(R) CPU and NVIDIA T4 Tensor Core GPU.
Software Dependencies No Decision trees are trained using the chefboost library with a maximum depth of 95. We used em core web sm model provided by spaCy library to obtain the NER and POS tags for visualization filters.
Experiment Setup Yes A batch size of 32 is applied for all models and datasets. The learning rate is set to 5e-5 for all models, except for the ALBERT model of SST2, 20ng and IMDB datasets, where it is set to 1e-5. All models are fine-tuned for 4 epochs, except for the 20ng, which used 30 epochs. To extract the features from learned embedding and reduce the number of input features into the decision tree, we experiment with quantile thresholds of 0.9 and 0.95 for correlation methods. For k-means clustering, we search for the number of clusters from 10 to 100 (step size: 10 or 20), except for IMDB where we search from 130 to 220 (step size: 30). For CNN features, we use kernel size 2, stride size 2, and padding size 0 for two convolution layers. The same hyperparameters are applied to the first pooling layer, except for IMDB where stride size 1 is used to ensure we can have enough input features for the next convolutional block and get enough large number of features from the last pooling layer. Kernel size and stride size for the last pooling layer are adjusted to maintain consistency with the number of clusters used in k-means clustering. Decision trees are trained using the chefboost library with a maximum depth of 95.