Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Towards Interpretable Natural Language Understanding with Explanations as Latent Variables
Authors: Wangchunshu Zhou, Jinyi Hu, Hanlin Zhang, Xiaodan Liang, Maosong Sun, Chenyan Xiong, Jian Tang
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on two natural language understanding tasks demonstrate that our framework can not only make effective predictions in both supervised and semi-supervised settings, but also generate good natural language explanations. |
| Researcher Affiliation | Collaboration | 1 Beihang University 2 Tsinghua University 3 South China University of Technology 4 Sun Yat-sen University 5 Microsoft Research 6 Mila-Québec AI Institute 7 HEC Montréal |
| Pseudocode | Yes | Algorithm 1: Explanation-based Self-Training (ELV-EST) |
| Open Source Code | Yes | Code is available at https://github.com/James Hujy/ELV.git |
| Open Datasets | Yes | We conduct experiments on two tasks: relation extraction (RE) and aspect-based sentiment classification (ASC). For relation extraction we choose two datasets, TACRED [23] and Sem Eval [21] in our experiments. We use two customer review datasets, Restaurant and Laptop, which are part of Sem Eval 2014 Task 4 [24] for the aspect-based sentiment classification task. |
| Dataset Splits | Yes | Table 1: Statistics of datasets. We present the size of train/dev/test sets for 4 datasets in both supervised and semi-supervised settings. Moreover, # Exp means the size of initial explanation sets. ... Sem Eval [21] 203 7,016 1,210 800 2,715 |
| Hardware Specification | No | The paper mentions using 'BERT-base and Uni LM-base as the backbone of our prediction model and explanation generation model, respectively.' but does not specify any hardware details like GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions using 'BERT-base' and 'Uni LM-base' as backbone models, 'Sentence BERT [19]' for embeddings, and 'Adam optimizers'. However, it does not provide specific version numbers for these or other software libraries/frameworks (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We select batch size over {32, 64} and learning rate over {1e-5, 2e-5, 3e-5}. The number of retrieved explanations is set to 10 for all tasks. We train the prediction model for 3 epochs and the generation model for 5 epochs in each EM iteration. We use Adam optimizers and early stopping with the best validation F1-score. |