FusionNet: Fusing via Fully-aware Attention with Application to Machine Comprehension

Authors: Hsin-Yuan Huang, Chenguang Zhu, Yelong Shen, Weizhu Chen

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply Fusion Net to the Stanford Question Answering Dataset (SQu AD) and it achieves the first position for both single and ensemble model on the official SQu AD leaderboard at the time of writing (Oct. 4th, 2017). Meanwhile, we verify the generalization of Fusion Net with two adversarial SQu AD datasets and it sets up the new state-of-the-art on both datasets: on Add Sent, Fusion Net increases the best F1 metric from 46.6% to 51.4%; on Add One Sent, Fusion Net boosts the best F1 metric from 56.0% to 60.7%.
Researcher Affiliation Collaboration Hsin-Yuan Huang*1,2, Chenguang Zhu1, Yelong Shen1, Weizhu Chen1 1Microsoft Business AI and Research 2National Taiwan University momohuang@gmail.com, {chezhu,yeshen,wzchen}@microsoft.com
Pseudocode No The paper describes the architecture and processes using diagrams and mathematical equations but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes An open-source implementation of Fusion Net can be found at https://github.com/momohuang/Fusion Net-NLI.
Open Datasets Yes We focus on the SQu AD dataset (Rajpurkar et al., 2016) to train and evaluate our model. SQu AD is a popular machine comprehension dataset consisting of 100,000+ questions created by crowd workers on 536 Wikipedia articles.
Dataset Splits Yes We focus on the SQu AD dataset (Rajpurkar et al., 2016) to train and evaluate our model.
Hardware Specification Yes On a single NVIDIA Ge Force GTX Titan X GPU, each epoch took roughly 20 minutes when batch size 32 is used.
Software Dependencies No The paper mentions software like "Py Torch" and "spa Cy" but does not provide specific version numbers for these dependencies.
Experiment Setup Yes Detailed experimental settings can be found in Appendix E. In Appendix E, it states: "The batch size is set to 32, and the optimizer is Adamax (Kingma & Ba, 2014) with a learning rate α = 0.002, β = (0.9, 0.999) and ϵ = 10 8. A fixed random seed is used across all experiments. During training, we use a dropout rate of 0.4 (Srivastava et al., 2014) after the embedding layer (Glo Ve and Co Ve) and before applying any linear transformation."