Improved Graph Contrastive Learning for Short Text Classification

Authors: Yonghao Liu, Lan Huang, Fausto Giunchiglia, Xiaoyue Feng, Renchu Guan

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results show that GIFT significantly outperforms previous state-of-the-art methods. Our code can be found in https://github.com/KEAML-JLU/GIFT.
Researcher Affiliation Academia Yonghao Liu1, Lan Huang1, Fausto Giunchiglia2, Xiaoyue Feng1*, Renchu Guan1 1Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University 2University of Trento yonghao20@mails.jlu.edu.cn, {huanglan, fengxy, guanrenchu}@jlu.edu.cn, fausto.giunchiglia@unitn.it
Pseudocode Yes Algorithm 1: The Training of GIFT Input: The corpus D = {di}N i=1 Output: The well-trained model 1: while not done do 2: for π {w, e, p} do 3: Build the component graph Gπ = {Vπ, Xπ, Aπ} 4: end for 5: Update node embeddings for each component graph using Eq. 1. 6: Construct TD matrices concerning the text and node 7: Obtain Text representations using Eq. 2. 8: Perform SVD for TD matrices using Eq. 4. 9: Obtain the augmented views of texts using Eq. 5. 10: Conduct CL using Eq. 6. 11: Perform constrained seed k-means for the corpus. 12: Assign weak labels for unlabeled texts. 13: Conduct cluster-oriented CL using Eq. 7. 14: Conduct the cross-entropy loss using Eq. 8. 15: Optimize the model by the loss of Eq. 9. 16: end while 17: return: The well-trained GIFT.
Open Source Code Yes Our code can be found in https://github.com/KEAML-JLU/GIFT.
Open Datasets Yes To verify the effectiveness of our proposed model, we conduct experiments on several benchmark datasets, which are widely used in STC tasks. The statistics of these datasets are summarized in Table 1 and described in detail below. (1) Twitter is a binary classification dataset comprised of numerous tweets expressing two sentiments collected by the NLTK. (2) MR (Pang and Lee 2005) is a binary classification dataset of movie reviews, where each review contains a sentence that is labeled as positive or negative. (3) Snippets (Phan, Nguyen, and Horiguchi 2008) consists of web search snippets returned by the Google search engine. (4) Stack Overflow (Xu et al. 2017) contains twenty categories of question titles crawled from the Stack Overflow website.
Dataset Splits Yes Following previous studies (Wang et al. 2021), we randomly select 40 labeled data for each category of the dataset, half of which are used for training, another half for validation, and the remaining data are used for testing, to simulate the real situation with few labeled samples.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types) used for running its experiments. It mentions SVD complexity but not the hardware.
Software Dependencies No The paper mentions "We use the Adam method to optimize GIFT with the learning rate 0.001." but does not specify version numbers for any software, libraries, or programming languages.
Experiment Setup Yes We adopt two-layer GCNs to encode each component graph, where the hidden dimension is set to 128. The temperature τ in CL and cluster-oriented CL are uniformly set to 0.5. All projection heads are implemented by an MLP with a hidden layer. In our case, the required rank of the approximate matrix is set to 15. The control parameter of loss function η, ζ are both set to 0.5. We use the Adam method to optimize GIFT with the learning rate 0.001.