Continuous Self-Attention Models with Neural ODE Networks
Authors: Jing Zhang, Peng Zhang, Baiwen Kong, Junqiu Wei, Xin Jiang14393-14401
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform a series of experiments on text classification, natural language inference (NLI) and text matching tasks. |
| Researcher Affiliation | Collaboration | Jing Zhang1, Peng Zhang1*, Baiwen Kong1, Junqiu Wei2, Xin Jiang2 1College of Intelligence and Computing, Tianjin University, Tianjin, China 2Huawei Noah s Ark Lab, China |
| Pseudocode | No | The paper describes the model architecture and components but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-source code of the described methodology. |
| Open Datasets | Yes | MR (Pang and Lee 2004a): Movie reviews are divided into positive and negative categories; CR (Hu and Liu 2004): Customer reviews set where the task is to predict positive or negative product reviews; SUBJ (Pang and Lee 2004b): Subjectivity dataset where the target is to classify a text as being subjective or objective; MPQA (Wiebe, Wilson, and Cardie 2005): Opinion polarity detection subtask; TREC (Li and Roth 2002): question classification dataset which involves classifying a question into 6 question types. ... SNLI (Bowman et al. 2015): Stanford Natural Language Inference is a benchmark dataset for natural language inference. ... Wiki QA (Yang, Yih, and Meek 2015) is a retrievalbased question answering dataset based on Wikipedia |
| Dataset Splits | No | The paper evaluates on test sets but does not specify the training, validation, and test splits (e.g., percentages or sample counts) for reproducibility, nor does it explicitly reference predefined splits with clear citations. |
| Hardware Specification | Yes | For all tasks, we implement our model with Pytorch-1.20, and train them on a Nvidia P40 GPU. |
| Software Dependencies | Yes | For all tasks, we implement our model with Pytorch-1.20 |
| Experiment Setup | Yes | Word embeddings are initialized by Glo Ve (Pennington, Socher, and Manning 2014) with 300-dimension. All other parameters are initialized with Xavier (Glorot and Bengio 2010) and normalized by weight normalization (Salimans and Kingma 2016). As for learning method, we use the Adam optimizer (Kingma and Ba 2014) and an exponentially decaying learning rate with a linear warm up. The dimension of the hidden vectors is set to 300, which is equal to the word embedding size. As for convolution, the filter size is set to 2. In addition, dropout with a keep probability of 0.1 is applied in the layers. The initial learning rate is set from 0.0001 to 0.003 and the batch size is tuned from 80 to 256. The L2 regularization decay factor is 10 5. In addition, the initial step size of the selfattention ODE solver is tuned from 10 2 to 5 10 1. |