Contextualized Non-Local Neural Networks for Sequence Learning

Authors: Pengfei Liu, Shuaichen Chang, Xuanjing Huang, Jian Tang, Jackie Chi Kit Cheung6762-6769

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on ten NLP tasks in text classification, semantic matching, and sequence labelling show that our proposed model outperforms competitive baselines and discovers task-specific dependency structures, thus providing better interpretability to users.
Researcher Affiliation Academia School of Computer Science, Fudan University, Shanghai Insitute of Intelligent Electronics & Systems MILA & Mc Gill University & The Ohio State University {pfliu14,xjhuang}@fudan.edu.cn, chang.1692@osu.edu,jian.tang@hec.ca,jcheung@cs.mcgill.ca
Pseudocode Yes Algorithm 1 Learning Processes of Contextualized Nonlocal Neural Networks for Sequences
Open Source Code No The paper does not contain any statement about releasing source code or a link to a code repository.
Open Datasets Yes We choose two typical datasets SICK (Marelli et al. 2014) and SNLI (Bowman et al. 2015) for this tasks. Sequence Labelling: We choose POS, Chunking and NER as evaluation tasks on Penn Treebank, Co NLL 2000 and Co NLL 2003 respectively.
Dataset Splits No The paper states, 'For each task, we take the hyperparameters which achieve the best performance on the development set via grid search.' This implies a validation set (development set), but it does not specify the splits (percentages or counts) for any of the datasets mentioned (QC, SST2, MR, IMDB, SICK, SNLI, POS, Chunking, NER).
Hardware Specification No The paper does not mention any specific hardware (GPU model, CPU model, memory, etc.) used for the experiments.
Software Dependencies No The paper mentions 'stochastic gradient descent with the diagonal variant of Ada Delta (Zeiler 2012)' and 'Glo Ve vectors (Pennington, Socher, and Manning 2014)' and 'Stanford NLP toolkit (Manning et al. 2014)'. While these are software/tools, specific version numbers are not provided for them as a whole.
Experiment Setup Yes To minimize the objective, we use stochastic gradient descent with the diagonal variant of Ada Delta (Zeiler 2012). The word embeddings for all of the models are initialized with Glo Ve vectors (Pennington, Socher, and Manning 2014). The other parameters are initialized by randomly sampling from a uniform distribution in [ 0.1, 0.1]. For each task, we take the hyperparameters which achieve the best performance on the development set via grid search.