Assessing Large Language Models on Climate Information

Authors: Jannis Bulian, Mike S. Schäfer, Afra Amini, Heidi Lam, Massimiliano Ciaramita, Ben Gaiarin, Michelle Chen Huebscher, Christian Buck, Niels G. Mede, Markus Leippold, Nadine Strauss

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a comprehensive evaluation framework, grounded in science communication research, to assess LLM responses to questions about climate change. ... We evaluate several recent LLMs on a set of diverse climate questions. Our results point to a significant gap between surface and epistemological qualities of LLMs in the realm of climate communication.
Researcher Affiliation Collaboration 1Google Deep Mind 2IKMZ Dept. of Communication and Media Research, University of Zurich, Switzerland 3ETH AI Center, ETH, Zurich, Switzerland 4Google 5Dept. of Finance, University of Zurich, Switzerland.
Pseudocode No Figure 1. Overview of the evaluation pipeline as described in Section 3.
Open Source Code No To aid reproducibility, we provide the exact evaluation protocols and all prompts used to generate data.
Open Datasets No The final question set used in the evaluations consists of 300 questions, i.e., 100 questions from each source. To aid reproducibility, we provide the exact evaluation protocols and all prompts used to generate data.
Dataset Splits No We test our evaluative dimensions in a human rating study. The rating task involves evaluating an answer based on the presentational (Section 2.1) and epistemological dimensions (Section 2.2). ... Each answer is assessed by three human raters.
Hardware Specification No We evaluate the following models: GPT-4 (Open AI, 2023), Chat GPT-3.5, Instruct GPT (turbo), Instruct GPT (text-davinci-003), Instruct GPT (text-davinci-002), as well as Pa LM2 (text-bison) (Anil et al., 2023) and Falcon-180B-Chat. This data was collected in the months of September and October 2023.
Software Dependencies No For consistency, we always use GPT-4 for this purpose. ... We use the same in-context probing approach as above and train a logistic regression classifier on top of flan-xxl representations.
Experiment Setup Yes To obtain answers we use a simple prompt consisting of the instruction: You are an expert on climate change communication. Answer each question in a 3-4 sentence paragraph. We include the answer length information to anchor the expected response to an objective value. ... For each dimension, we ask the model to express its agreement or disagreement that the information is presented well according to that dimension. ... We sample 3 responses (temperature 0.6) from GPT-4 for each question to replicate the setup we have with human raters.