BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models
Authors: Qingqing Cao, Sewon Min, Yizhong Wang, Hannaneh Hajishirzi
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that on five knowledge-intensive NLP tasks, BTR accelerates state-of-the-art retrieval-augmented language model inference by up to 4x and reduces storage by over 100x while maintaining over 95% task performance. |
| Researcher Affiliation | Academia | Qingqing Cao, Sewon Min, Yizhong Wang, Hannaneh Hajishirzi Paul G. Allen School of Computer Science & Engineering University of Washington {qicao,sewon,yizhongw,hannaneh}@cs.washington.edu |
| Pseudocode | Yes | A.1 TOKEN COMPRESSION ALGORITHM Algorithm 1 Offline Compression for Binary Token Representations ... Algorithm 2 Runtime Compression |
| Open Source Code | Yes | 1Our code is publicly available at https://github.com/csarron/BTR |
| Open Datasets | Yes | We evaluate BTR and baselines on three open-domain QA tasks: Natural Questions (NQ, Kwiatkowski et al. (2019)), Trivia QA (TQA, Joshi et al. (2017)), Web Questions (WQ, Berant et al. (2013)); one fact-checking task: FEVER (Thorne et al., 2018), and one knowledge-intensive reasoning benchmark: the mass-multitask language understanding (MMLU) dataset (Hendrycks et al., 2020). |
| Dataset Splits | Yes | Table 5: Statistics of the number of examples for the evaluation datasets. NQ Train 79168 Validation 8757 Test 3610 |
| Hardware Specification | Yes | We conducted training using 4 to 8 A40 or A100 GPUs (depending on their availability on our cluster) with BF16 mixed precision. |
| Software Dependencies | Yes | We develop BTR based on the Atlas codebase using Py Torch 1.13.1 and Hugging Face Transformers v4.18.0 (Wolf et al., 2020). |
| Experiment Setup | Yes | Table 4: Training hyperparameters for BTR-Atlas. NQ TQA WQ Fever MMLU Hyperparameters base large base large base large base large base large batch size 8 4 8 4 8 4 8 4 4 2 learning rate 6e-5 6e-5 4e-5 4e-5 8e-5 8e-5 6e-5 6e-5 5e-5 5e-6 training steps 20000 20000 20000 20000 3000 3000 10000 10000 2000 2000 warmup steps 200 200 200 200 200 200 200 200 50 50 weight decay 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 number of passages 40 40 40 40 40 40 40 40 30 30 max query length 40 40 64 64 40 40 40 40 256 256 max passage length 320 320 320 320 320 320 320 320 320 320 max answer length 32 32 32 32 32 32 32 32 32 32 |