Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
P-Adapters: Robustly Extracting Factual Information from Language Models with Diverse Prompts
Authors: Benjamin Newman, Prafulla Kumar Choubey, Nazneen Rajani
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | P-Adapters show between 12-26% absolute improvement in precision and 36-50% absolute improvement in consistency over a baseline of only using natural language queries. Additionally, we investigate Mixture of Experts (Mo E) models that learn a set of continuous prompts ( experts ) and select one to query the LLM. |
| Researcher Affiliation | Collaboration | Benjamin Newman Stanford University Prafulla Kumar Choubey Salesforce Research Nazneen Rajani Salesforce Research EMAIL. Work conducted during internship at Salesforce Research. |
| Pseudocode | No | The paper describes model architectures and procedures but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | To encourage the use of P-Adapters to effectively extract factual information, we release the code used to train them. 1https://github.com/salesforce/factlm |
| Open Datasets | Yes | We use the entity pairs and relations from the T-Rex split of the LAMA work (Elsahar et al., 2018; Petroni et al., 2019) in our experiments. This data is used for evaluation. For training and validation, we use separate sets of entity pairs for each relation collected by Shin et al. (2020), which they use to optimize their discrete prompts. The templates we use are pooled from prior work: LAMA, LPAQA, and Para Rel datasets (Jiang et al., 2020; Elazar et al., 2021). |
| Dataset Splits | Yes | For training and validation, we use separate sets of entity pairs for each relation collected by Shin et al. (2020), which they use to optimize their discrete prompts. We split the templates into two equal-sized groups: one for training and one for OOD Prompt evaluation. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments, such as specific GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer,' 'Adam W optimizer,' 'Hugging Face Transformers,' and 'nlpaug package,' but it does not specify exact version numbers for these software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | All of our P-Adapters were trained using the hyperparameters from Liu et al. (2021b): Adam optimizer with a learning rate of 1e 5, weight decay of 5e 4, a batch size of 128, and an exponential learning rate decay schedule with a decay rate of 0.98 (Kingma & Ba, 2015). Our Mo E classifiers were trained using an Adam W optimizer with a learning rate of 0.001 and linear learning rate decay (Loshchilov & Hutter, 2018). We use Hugging Face Transformers to train the model for 3 epochs on the same training data used to train the P-Adapter models (Wolf et al., 2020). |