Learning by Interpreting
Authors: Xuting Tang, Abdul Rafae Khan, Shusen Wang, Jia Xu
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate a significant increase in accuracy of up to +3.4 BLEU points on NMT and up to +4.8 points on GLUE tasks, verifying our hypothesis that it is possible to achieve better model learning by incorporating model interpretation knowledge. |
| Researcher Affiliation | Collaboration | Xuting Tang1 , Abdul Rafae Khan 1 , Shusen Wang 2, and Jia Xu1 1Steven Institute of Technology 2Xiaohongshu Inc {xtang18, akhan4, jxu70}@stevens.edu, shusenwang@xiaohongshu.com |
| Pseudocode | Yes | Algorithm 1 Attention Words/Positions Extraction |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-source code of their methodology. It mentions using existing toolkits and libraries like Huggingface, Fairseq, LIME, and SHAP but not their own specific implementation. |
| Open Datasets | Yes | We conduct our experiments on the large-scale WMT 2014 dataset in the German-English news domain and the medium scale IWSLT 2017 dataset in the French-English TEDTalk domain. |
| Dataset Splits | Yes | Both models use early stopping with a patience of 5 and the max number of epochs is 50. The training reaches its best validation BLEU score at iteration 10, and its corresponding test BLEU score is +0.4 higher than fine-tuning with one iteration. The system is fine-tuned once more using the best learning rate on the validation set. |
| Hardware Specification | Yes | The training is performed on 4 Nvidia Titan V GPUs. |
| Software Dependencies | No | The paper mentions software like Fairseq toolkit, Huggingface BERT-base, Adam optimizer, LIME, and SHAP, but it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | The hyper-parameters for training the Conv S2S network include: the encoder and decoder embedding dimensions as 768, the learning rate as 0.25, the gradient clip norm as 0.1, the dropout ratio as 0.2, the max tokens in a batch as 4000, the optimizer as NAG. For Transformer, we set the embedding dimension as 768, the learning rate as 5 10 4, the warmup updates as 4000, the dropout as 0.3, the weight decay as 1 10 4, the max tokens in a batch as 5000, the Adam optimizer [Kingma and Ba, 2014] betas as 0.9 and 0.98. The hyper-parameters for fine-tuning are as following: the encoder dimension is 768, the number of epochs is 3, the learning rate is 5 10 5, the dropout rate is 0.1, and the Adam optimizer betas is 0.9 & 0.999. |