Assessing Large Language Models on Climate Information
Authors: Jannis Bulian, Mike S. Schäfer, Afra Amini, Heidi Lam, Massimiliano Ciaramita, Ben Gaiarin, Michelle Chen Huebscher, Christian Buck, Niels G. Mede, Markus Leippold, Nadine Strauss
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a comprehensive evaluation framework, grounded in science communication research, to assess LLM responses to questions about climate change. ... We evaluate several recent LLMs on a set of diverse climate questions. Our results point to a significant gap between surface and epistemological qualities of LLMs in the realm of climate communication. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2IKMZ Dept. of Communication and Media Research, University of Zurich, Switzerland 3ETH AI Center, ETH, Zurich, Switzerland 4Google 5Dept. of Finance, University of Zurich, Switzerland. |
| Pseudocode | No | Figure 1. Overview of the evaluation pipeline as described in Section 3. |
| Open Source Code | No | To aid reproducibility, we provide the exact evaluation protocols and all prompts used to generate data. |
| Open Datasets | No | The final question set used in the evaluations consists of 300 questions, i.e., 100 questions from each source. To aid reproducibility, we provide the exact evaluation protocols and all prompts used to generate data. |
| Dataset Splits | No | We test our evaluative dimensions in a human rating study. The rating task involves evaluating an answer based on the presentational (Section 2.1) and epistemological dimensions (Section 2.2). ... Each answer is assessed by three human raters. |
| Hardware Specification | No | We evaluate the following models: GPT-4 (Open AI, 2023), Chat GPT-3.5, Instruct GPT (turbo), Instruct GPT (text-davinci-003), Instruct GPT (text-davinci-002), as well as Pa LM2 (text-bison) (Anil et al., 2023) and Falcon-180B-Chat. This data was collected in the months of September and October 2023. |
| Software Dependencies | No | For consistency, we always use GPT-4 for this purpose. ... We use the same in-context probing approach as above and train a logistic regression classifier on top of flan-xxl representations. |
| Experiment Setup | Yes | To obtain answers we use a simple prompt consisting of the instruction: You are an expert on climate change communication. Answer each question in a 3-4 sentence paragraph. We include the answer length information to anchor the expected response to an objective value. ... For each dimension, we ask the model to express its agreement or disagreement that the information is presented well according to that dimension. ... We sample 3 responses (temperature 0.6) from GPT-4 for each question to replicate the setup we have with human raters. |