Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
RECOMP: Improving Retrieval-Augmented LMs with Context Compression and Selective Augmentation
Authors: Fangyuan Xu, Weijia Shi, Eunsol Choi
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on language modeling task and open domain question answering task. We achieve a compression rate of as low as 6% with minimal loss in performance for both tasks, significantly outperforming the off-the-shelf summarization models. |
| Researcher Affiliation | Academia | Fangyuan Xu1, Weijia Shi2, Eunsol Choi1 Department of Computer Science 1The University of Texas at Austin, 2University of Washington EMAIL , EMAIL |
| Pseudocode | Yes | Figure 2: Learning an extractive compressor for language modeling task. Figure 3: Learning an abstractive compressor for language modeling task. |
| Open Source Code | Yes | Our code is available at https://github.com/carriex/recomp. |
| Open Datasets | Yes | For the language modeling task, we generate training data using the training split of the Wikitext-103 dataset... Natural Questions (NQ) (Kwiatkowski et al., 2019), Trivia QA (Joshi et al., 2017)) and Hotpot QA (Yang et al., 2018). |
| Dataset Splits | Yes | We report results on development set of NQ, test set of Trivia QA and randomly sampled 500 examples from Hotpot QA development set. Table 5: Training data statistics for abstractive and extractive compressors. NQ Train 42,149 Validation 9,769, TQA Train 70,032 Validation 8,753, Hotpot QA Train 24,526 Validation 3,068, Wikitext Train 1,398,318 Validation 1,5483. |
| Hardware Specification | Yes | We run FLAN-UL2 on 4 A40 GPUs. For compression, we run contriver and T5 on a single A40 GPU (Table 6). |
| Software Dependencies | No | The paper mentions 'Transformers', 'sentence-transformer library', 'spaCy', and 'NLTK' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We train with Adam optimizer (Kingma & Ba, 2014), using a batch size of 64, learning rate of 2e-5 and 1000 warmup steps for 3 epochs. We train abstractive summarizer with Adam optimizer (Kingma & Ba, 2014), using a batch size of 16, learning rate of 1e-5 and 1000 warmup steps for 3 epochs. |