Multi-Scale Self-Attention for Text Classification
Authors: Qipeng Guo, Xipeng Qiu, Pengfei Liu, Xiangyang Xue, Zheng Zhang7847-7854
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Results of three different kinds of tasks (21 datasets) show our Multi-Scale Transformer outperforms the standard Transformer consistently and significantly on small and moderate size datasets. |
| Researcher Affiliation | Collaboration | Qipeng Guo, Xipeng Qiu, Pengfei Liu, Xiangyang Xue, Zheng Zhang Shanghai Key Laboratory of Intelligent Information Processing, Fudan University School of Computer Science, Fudan University AWS Shanghai AI Lab New York University Shanghai {qpguo16, xpqiu, pfliu14, xyxue}@fudan.edu.cn, zz@nyu.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states 'We implement the MS-Trans with Pytorch1 and DGL(Wang et al. 2019)' with a footnote linking to 'https://pytorch.org'. This link is for a third-party library, not the authors' own source code for their methodology. |
| Open Datasets | Yes | We evaluate our model on 17 text classification datasets, 3 sequence labeling datasets and 1 natural language inference dataset. All the statistics can be found in Tab-1. (Table 1 lists datasets such as SST (Socher et al. 2013), MTL-16 (Liu, Qiu, and Huang 2017), PTB POS (Marcus, Santorini, and Marcinkiewicz 1993), Co NLL03 (Sang and Meulder 2003), Co NLL2012 NER (Pradhan et al. 2012), SNLI (Bowman et al. 2015).) |
| Dataset Splits | Yes | Table 1: An overall of datasets and its hyper-parameters... Dataset Train Dev. Test |V | H DIM α head DIM (e.g., SST 8k 1k 2k, MTL-16 1400 200 400) |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for running the experiments (e.g., CPU/GPU models, memory). |
| Software Dependencies | No | The paper mentions 'Pytorch' and 'DGL' but does not specify version numbers for these software dependencies. |
| Experiment Setup | Yes | Table 1: An overall of datasets and its hyper-parameters, H DIM, α, head DIM indicates the dimension of hidden states, the hyper-parameter for controlling the scale distribution, the dimension of each head, respectively. The optimizer is Adam (Kingma and Ba 2014) and the learning rate and dropout ratio are listed in the Appendix. |