Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Uncovering Gaps in How Humans and LLMs Interpret Subjective Language
Authors: Erik Jones, Arjun Patrawala, Jacob Steinhardt
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate TED by measuring how well the failures it uncovers predict downstream behavior in two settings: output-editing and inference-steering. ... We include the full quantitative results in Table 1, and find that for nearly every failure type, semantic thesaurus, and model, TED's average success rate is always higher than the semantic-only baseline, and is frequently much higher. |
| Researcher Affiliation | Academia | Erik Jones , Arjun Patrawala , & Jacob Steinhardt UC Berkeley EMAIL |
| Pseudocode | No | No, the paper describes the method "THESAURUS ERROR DETECTION (TED)" in Section 3 and its instantiation in Section 4 using descriptive text and mathematical formulations (e.g., Equation 1), but it does not present a clearly labeled pseudocode block or algorithm. |
| Open Source Code | Yes | Code is available at https://github.com/arjunpat/thesaurus-error-detector |
| Open Datasets | Yes | The exhaustive list of ethical questions is made available in the code |
| Dataset Splits | Yes | To minimize overlap between training and test datasets, we find it effective to prompt GPT-4 to generate 200 ethical questions, saving 100 for training semantic embeddings and 100 for testing them in the output-editing failures test. |
| Hardware Specification | Yes | Inference occurs on single A100 40 GB with a temperature = 1, while gradients are computed on an 80 GB A100. |
| Software Dependencies | No | No, the paper mentions using "vLLM" and "Hugging Face transformers library (Wolf et al., 2019)" and "PyTorch" but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We average n = 100 prompts to construct the embeddings, and set τsim = 0.93 and τdis = 0.1 for Mistral on the unexpected edits and inadequate updates respectively. ... For Llama 3 we set τsim = 0.98 and τdis = 0.5. ... Inference occurs on single A100 40 GB with a temperature = 1 |