Towards a Deep and Unified Understanding of Deep Neural Models in NLP

Authors: Chaoyu Guan, Xiting Wang, Quanshi Zhang, Runjin Chen, Di He, Xing Xie

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show how our method can be applied to four widely used models in NLP and explain their performances on three real-world benchmark datasets. Third, we demonstrate how the information-based measure enriches the capability of explaining DNNs by conducting experiments in one synthetic and three real-world benchmark datasets (RQ3). We explain four widely used models in NLP, including BERT (Devlin et al., 2018), Transformer (Vaswani et al., 2017), LSTM (Hochreiter & Schmidhuber, 1997), and CNN (Kim, 2014).
Researcher Affiliation Collaboration 1 John Hopcroft Center and the Mo E Key Lab of Artificial Intelligence, AI Institute, at the Shanghai Jiao Tong University, Shanghai, China 2 Microsoft Research Asia, Beijing, China 3 Peking University, Beijing, China.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes 1Codes available at https://aka.ms/nlp/explainability
Open Datasets Yes SST-2 (Socher et al., 2013) is the sentiment analysis benchmark we introduce in Sec. 4.2. Co LA (Warstadt et al., 2018) stands for the Corpus of Linguistic Acceptability. QQP (Iyer et al., 2018) is the Quora Question Pairs dataset.
Dataset Splits No The paper mentions using datasets like SST-2, Co LA, and QQP and reports performance metrics like accuracy and MCC, implying data splits were used. However, it does not explicitly provide specific details about the training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit references to predefined splits with full citation for the split methodology).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment.
Experiment Setup Yes We train a two-layer LSTM model (with attention)... We train a LSTM model that contains four 768D (per direction) Bidirectional LSTM layers, a max-pooling layer, and a fully connected layer. The inputs word embeddings are 768D randomly initialized vectors... Here, we use the encoder from Transformer... The encoder consists of 3 multi-head self-attention layers (head number is 4, hidden state size is 256, and feed-forward output size is 1024), a first-pooling layer and a fully connected layer. The input word embedding is randomly initialized 256D vectors.