Block-Skim: Efficient Question Answering for Transformer

Authors: Yue Guan, Zhengyi Li, Zhouhan Lin, Yuhao Zhu, Jingwen Leng, Minyi Guo10710-10719

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluation Experimental Setup, We evaluate our method on 6 extractive QA datasets, including SQuAD 1.1... Table 1 shows the result on multiple QA datasets.
Researcher Affiliation Academia Yue Guan1,2, Zhengyi Li1,2, Zhouhan Lin1, Yuhao Zhu3, Jingwen Leng1,2, Minyi Guo1,2, 1 Shanghai Jiao Tong University 2 Shanghai Qi Zhi Institute 3 University of Rochester
Pseudocode No The paper describes its method using prose and diagrams (Figure 1, Figure 4) but does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes We implement the proposed method based on open-sourced library from Wolf et al. (2019)1. ... 1The source code is available at https://github.com/ChandlerGuan/blockskim.
Open Datasets Yes We evaluate our method on 6 extractive QA datasets, including SQuAD 1.1 (Rajpurkar et al. 2016), Natural Questions (Kwiatkowski et al. 2019), Trivia QA (Joshi et al. 2017), News QA (Trischler et al. 2016), Search QA(Dunn et al. 2017) and Hotpot QA (Yang et al. 2018).
Dataset Splits Yes We evaluate our method on 6 extractive QA datasets, including SQuAD 1.1 (Rajpurkar et al. 2016)... The attention heatmaps are profiled on the development set of SQu AD dataset with a BERTbase model...
Hardware Specification Yes We use four V100 GPUs with 32 GB memory for the training experiments.
Software Dependencies No The paper mentions using an 'open-sourced library from Wolf et al. (2019)' (Hugging Face Transformers) and 'Torch Profile(Liu 2020)', but it does not specify explicit version numbers for these software dependencies or other libraries.
Experiment Setup Yes We initialize the learning rate to 3e 5 for BERT models and 5e 5 for ALBERT with a linear learning rate scheduler. For SQuAD dataset, we apply batch size 16 and maximum sequence length 384. And for the other datasets, we apply batch size 32 and maximum sequence length 512. We perform all the experiments reported with random seed 42. ... The balance factor β is determined by block sample numbers and reported in Tbl. 1. The harmony factor α is 0.01 for ALBERT and 0.1 for all the other models we used. It is determined by hyper-parameter grid search from 1e 3 to 10 with a step of 10.