Learning by Interpreting

Authors: Xuting Tang, Abdul Rafae Khan, Shusen Wang, Jia Xu

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate a significant increase in accuracy of up to +3.4 BLEU points on NMT and up to +4.8 points on GLUE tasks, verifying our hypothesis that it is possible to achieve better model learning by incorporating model interpretation knowledge.
Researcher Affiliation Collaboration Xuting Tang1 , Abdul Rafae Khan 1 , Shusen Wang 2, and Jia Xu1 1Steven Institute of Technology 2Xiaohongshu Inc {xtang18, akhan4, jxu70}@stevens.edu, shusenwang@xiaohongshu.com
Pseudocode Yes Algorithm 1 Attention Words/Positions Extraction
Open Source Code No The paper does not provide an explicit statement or link for the open-source code of their methodology. It mentions using existing toolkits and libraries like Huggingface, Fairseq, LIME, and SHAP but not their own specific implementation.
Open Datasets Yes We conduct our experiments on the large-scale WMT 2014 dataset in the German-English news domain and the medium scale IWSLT 2017 dataset in the French-English TEDTalk domain.
Dataset Splits Yes Both models use early stopping with a patience of 5 and the max number of epochs is 50. The training reaches its best validation BLEU score at iteration 10, and its corresponding test BLEU score is +0.4 higher than fine-tuning with one iteration. The system is fine-tuned once more using the best learning rate on the validation set.
Hardware Specification Yes The training is performed on 4 Nvidia Titan V GPUs.
Software Dependencies No The paper mentions software like Fairseq toolkit, Huggingface BERT-base, Adam optimizer, LIME, and SHAP, but it does not provide specific version numbers for these software components.
Experiment Setup Yes The hyper-parameters for training the Conv S2S network include: the encoder and decoder embedding dimensions as 768, the learning rate as 0.25, the gradient clip norm as 0.1, the dropout ratio as 0.2, the max tokens in a batch as 4000, the optimizer as NAG. For Transformer, we set the embedding dimension as 768, the learning rate as 5 10 4, the warmup updates as 4000, the dropout as 0.3, the weight decay as 1 10 4, the max tokens in a batch as 5000, the Adam optimizer [Kingma and Ba, 2014] betas as 0.9 and 0.98. The hyper-parameters for fine-tuning are as following: the encoder dimension is 768, the number of epochs is 3, the learning rate is 5 10 5, the dropout rate is 0.1, and the Adam optimizer betas is 0.9 & 0.999.