ACT: an Attentive Convolutional Transformer for Efficient Text Classification
Authors: Pengfei Li, Peixiang Zhong, Kezhi Mao, Dongzhe Wang, Xuefeng Yang, Yunfeng Liu, Jianxiong Yin, Simon See13261-13269
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on various text classification tasks and detailed analyses show that ACT is a lightweight, fast, and effective universal text classifier, outperforming CNNs, RNNs, and attentive models including Transformer. |
| Researcher Affiliation | Collaboration | Pengfei Li,1 Peixiang Zhong,1 Kezhi Mao,1* Dongzhe Wang,2 Xuefeng Yang,2 Yunfeng Liu,2 Jianxiong Yin,3 Simon See3 1 Nanyang Technological University, Singapore 2 Zhui Yi Technology, Shenzhen, China, 3 NVIDIA AI Tech Center |
| Pseudocode | No | The paper includes figures illustrating the architecture and mathematical equations but no explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository for the ACT model. |
| Open Datasets | Yes | We use six widely-studied datasets to evaluate our model, two for each text classification task. These datasets are diverse in the aspects of type, size, number of classes, and document length. Table 1 shows the statistics of the datasets. For sentiment analysis, we use two datasets constructed by Zhang et al. (2015)... For topic categorization, we use AG s News (AGNews) and DBPedia datasets created by Zhang et al. (2015)... For relation extraction, we use TACRED and Sem Eval2010-task8 (Sem Eval) datasets... |
| Dataset Splits | Yes | For sentiment analysis and topic categorization, we set aside 10% of training data as the development set to tune model hyperparameters. |
| Hardware Specification | Yes | We report the average time needed to compute a single batch (batch size of 100) of Yelp F. dataset using NVIDIA Tesla P40 GPU with Intel Xeon E5-2667 CPU. |
| Software Dependencies | No | The paper mentions several techniques and components like 'Glove word embeddings', 'Dropout regularization', 'Ge LUs', and 'center loss', but it does not specify any software names with version numbers (e.g., PyTorch version, Python version, etc.) that would allow for reproducible setup. |
| Experiment Setup | Yes | In our experiments, word embedding matrix Wwrd is initialized with 300-d Glove word embeddings (Pennington, Socher, and Manning 2014). The fully connected layer before softmax has a dimension of 100. Dropout regularization (Srivastava et al. 2014) with a rate of 0.4 is applied during training. The weight and learning rate for center loss are 0.001 and 0.1 respectively. The models are trained using SGD with initial learning rate of 0.01 and momentum of 0.9. Learning rate is decayed with a rate of 0.9 after 10 epochs if the score on the development set does not improve. Batch size is set to 100 and the model is trained for 70 epochs. The dimensions of global attention and position embedding are 200 and 60 respectively. We use Ge LUs (Hendrycks and Gimpel 2016) for all the nonlinear activation functions. |