Block-Skim: Efficient Question Answering for Transformer
Authors: Yue Guan, Zhengyi Li, Zhouhan Lin, Yuhao Zhu, Jingwen Leng, Minyi Guo10710-10719
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluation Experimental Setup, We evaluate our method on 6 extractive QA datasets, including SQuAD 1.1... Table 1 shows the result on multiple QA datasets. |
| Researcher Affiliation | Academia | Yue Guan1,2, Zhengyi Li1,2, Zhouhan Lin1, Yuhao Zhu3, Jingwen Leng1,2, Minyi Guo1,2, 1 Shanghai Jiao Tong University 2 Shanghai Qi Zhi Institute 3 University of Rochester |
| Pseudocode | No | The paper describes its method using prose and diagrams (Figure 1, Figure 4) but does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | We implement the proposed method based on open-sourced library from Wolf et al. (2019)1. ... 1The source code is available at https://github.com/ChandlerGuan/blockskim. |
| Open Datasets | Yes | We evaluate our method on 6 extractive QA datasets, including SQuAD 1.1 (Rajpurkar et al. 2016), Natural Questions (Kwiatkowski et al. 2019), Trivia QA (Joshi et al. 2017), News QA (Trischler et al. 2016), Search QA(Dunn et al. 2017) and Hotpot QA (Yang et al. 2018). |
| Dataset Splits | Yes | We evaluate our method on 6 extractive QA datasets, including SQuAD 1.1 (Rajpurkar et al. 2016)... The attention heatmaps are profiled on the development set of SQu AD dataset with a BERTbase model... |
| Hardware Specification | Yes | We use four V100 GPUs with 32 GB memory for the training experiments. |
| Software Dependencies | No | The paper mentions using an 'open-sourced library from Wolf et al. (2019)' (Hugging Face Transformers) and 'Torch Profile(Liu 2020)', but it does not specify explicit version numbers for these software dependencies or other libraries. |
| Experiment Setup | Yes | We initialize the learning rate to 3e 5 for BERT models and 5e 5 for ALBERT with a linear learning rate scheduler. For SQuAD dataset, we apply batch size 16 and maximum sequence length 384. And for the other datasets, we apply batch size 32 and maximum sequence length 512. We perform all the experiments reported with random seed 42. ... The balance factor β is determined by block sample numbers and reported in Tbl. 1. The harmony factor α is 0.01 for ALBERT and 0.1 for all the other models we used. It is determined by hyper-parameter grid search from 1e 3 to 10 with a step of 10. |