Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval
Authors: Sheryl Hsu, Omar Khattab, Chelsea Finn, Archit Sharma
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Le Re T can improve the absolute retrieval accuracy by up to 29% and the downstream generator evaluations by 17%. The simplicity and flexibility of Le Re T allows it to be applied to arbitrary off-the-shelf retrievers and makes it a promising technique for improving general LLM pipelines. Project website: http://sherylhsu.com/Le Re T/. |
| Researcher Affiliation | Collaboration | 1Stanford University,2Databricks,3Physical Intelligence,4Google Deep Mind EMAIL |
| Pseudocode | Yes | Algorithm 1 Prompt Driven Diverse Sampling + Training |
| Open Source Code | No | Project website: http://sherylhsu.com/Le Re T/. |
| Open Datasets | Yes | We test Le Re T on Hotpot QA (Yang et al., 2018) and Ho Ver (Jiang et al., 2020). |
| Dataset Splits | Yes | We test Le Re T on Hotpot QA (Yang et al., 2018) and Ho Ver (Jiang et al., 2020). Both datasets are based on a Wikipedia knowledge base and are multi-hop, meaning that models must reason across multiple articles to arrive at the correct answer. The datasets provide both the correct answer and supporting articles. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running the experiments. |
| Software Dependencies | No | We implement our sampling pipeline on top of DSPy (Khattab et al., 2023), specifically defining a single hop as a program and sampling data using the evaluate functions. |
| Experiment Setup | Yes | We use a learning rate of 1e-7 for SFT/context distillation in all our experiments, and use a τ = 0.05 and learning rate of 1e-7. We train SFT for 1 epoch, and we only distill the best performing prompt. We train IPO for 2 epochs. |