reproducibilityindex.ai

Learning by Interpreting

Authors: Xuting Tang, Abdul Rafae Khan, Shusen Wang, Jia Xu

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate a significant increase in accuracy of up to +3.4 BLEU points on NMT and up to +4.8 points on GLUE tasks, verifying our hypothesis that it is possible to achieve better model learning by incorporating model interpretation knowledge.
Researcher Affiliation	Collaboration	Xuting Tang1 , Abdul Rafae Khan 1 , Shusen Wang 2, and Jia Xu1 1Steven Institute of Technology 2Xiaohongshu Inc {xtang18, akhan4, jxu70}@stevens.edu, shusenwang@xiaohongshu.com
Pseudocode	Yes	Algorithm 1 Attention Words/Positions Extraction
Open Source Code	No	The paper does not provide an explicit statement or link for the open-source code of their methodology. It mentions using existing toolkits and libraries like Huggingface, Fairseq, LIME, and SHAP but not their own specific implementation.
Open Datasets	Yes	We conduct our experiments on the large-scale WMT 2014 dataset in the German-English news domain and the medium scale IWSLT 2017 dataset in the French-English TEDTalk domain.
Dataset Splits	Yes	Both models use early stopping with a patience of 5 and the max number of epochs is 50. The training reaches its best validation BLEU score at iteration 10, and its corresponding test BLEU score is +0.4 higher than fine-tuning with one iteration. The system is fine-tuned once more using the best learning rate on the validation set.
Hardware Specification	Yes	The training is performed on 4 Nvidia Titan V GPUs.
Software Dependencies	No	The paper mentions software like Fairseq toolkit, Huggingface BERT-base, Adam optimizer, LIME, and SHAP, but it does not provide specific version numbers for these software components.
Experiment Setup	Yes	The hyper-parameters for training the Conv S2S network include: the encoder and decoder embedding dimensions as 768, the learning rate as 0.25, the gradient clip norm as 0.1, the dropout ratio as 0.2, the max tokens in a batch as 4000, the optimizer as NAG. For Transformer, we set the embedding dimension as 768, the learning rate as 5 10 4, the warmup updates as 4000, the dropout as 0.3, the weight decay as 1 10 4, the max tokens in a batch as 5000, the Adam optimizer [Kingma and Ba, 2014] betas as 0.9 and 0.98. The hyper-parameters for fine-tuning are as following: the encoder dimension is 768, the number of epochs is 3, the learning rate is 5 10 5, the dropout rate is 0.1, and the Adam optimizer betas is 0.9 & 0.999.