Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Interpreting Language Reward Models via Contrastive Explanations
Authors: Junqi Jiang, Tom Bewley, Saumitra Mishra, Freddy Lecue, Manuela Veloso
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we demonstrate the effectiveness of our method for generating high-quality contrastive explanations. Our experiments are conducted on three open source human preference datasets and three RMs. For each dataset, we randomly select 30 binary comparisons from the training set serving as test comparisons (repeated five times with different random seeds, making 150 test comparisons in total), for which we then generate contrastive explanations using our method and the baselines for each RM. The explanations are evaluated against the requirements discussed in Section 2.3 using popular metrics from the text CF literature (Nguyen et al., 2024). |
| Researcher Affiliation | Collaboration | Imperial College London J.P. Morgan AI Research EMAIL,{firstname.surname}@jpmorgan.com |
| Pseudocode | No | The paper describes methods using natural language and a visual overview in Figure 2, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a direct link to a code repository for their implementation. |
| Open Datasets | Yes | We use Help Steer2 (hs2) (Wang et al., 2024b), HH-RLHF-helpful, and HH-RLHF-harmless1 (Bai et al., 2022). |
| Dataset Splits | No | For each dataset, we randomly select 30 binary comparisons from the training set serving as test comparisons (repeated five times with different random seeds, making 150 test comparisons in total), for which we then generate contrastive explanations using our method and the baselines for each RM. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU models, or cloud computing instances with specifications) used to conduct the experiments. |
| Software Dependencies | No | The paper mentions several software components and models like GPT-4o, Sentence-BERT, and Polyjuice (Wu et al., 2021), but it does not specify explicit version numbers for these or other key software dependencies required to replicate the experiments. |
| Experiment Setup | Yes | In our experiments, Y+ and Y both contain 15 perturbed responses, each associated with one attribute from the following list: avoid-to-answer, appropriateness, assertiveness, clarity, coherence, complexity, correctness, engagement, harmlessness, helpfulness, informativeness, neutrality, relevance, sensitivity, verbosity. [...] For each dataset, we randomly select 30 binary comparisons from the training set serving as test comparisons (repeated five times with different random seeds, making 150 test comparisons in total), for which we then generate contrastive explanations using our method and the baselines for each RM. |