Large Language Models are Geographically Biased

Authors: Rohin Manvi, Samar Khanna, Marshall Burke, David B. Lobell, Stefano Ermon

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show various problematic geographic biases, which we define as systemic errors in geospatial predictions. Initially, we demonstrate that LLMs are capable of making accurate zero-shot geospatial predictions in the form of ratings that show strong monotonic correlation with ground truth (Spearman s ρ of up to 0.89). We then show that LLMs exhibit common biases across a range of objective and subjective topics. In particular, LLMs are clearly biased against locations with lower socioeconomic conditions (e.g. most of Africa) on a variety of sensitive subjective topics such as attractiveness, morality, and intelligence (Spearman s ρ of up to 0.70). Finally, we introduce a bias score to quantify this and find that there is significant variation in the magnitude of bias across existing LLMs.
Researcher Affiliation Academia Rohin Manvi 1 Samar Khanna 1 Marshall Burke 1 David Lobell 1 Stefano Ermon 1 ... 1Stanford University. Correspondence to: Rohin Manvi <rohinm@cs.stanford.edu>.
Pseudocode No The paper describes its methods in prose and does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code Yes Code is available on the project website: https://rohinmanvi.github.io/Geo LLM
Open Datasets Yes This includes Infant Mortality Rate (CIESIN, 2021), Population Density (Tatem, 2017), Built-Up to Non Built-Up Area Ratio (JRC & CIESIN, 2021; Florczyk et al., 2019), Nighttime Light Intensity (Elvidge et al., 2017), Average Temperature (Karger et al., 2018), and Annual Precipitation (Karger et al., 2018).
Dataset Splits No The paper evaluates LLMs in a zero-shot setting and collects 2000 prompts for evaluation, stating: 'To visualize an LLM s ratings on a global scale, we select 2000 prompts aiming for a good balance between relevant locations as well as good geographical coverage.' However, it does not specify explicit training, validation, or test *splits* in the traditional sense, as the models being evaluated are pre-trained and used directly without further training within this study.
Hardware Specification No The paper mentions the specific LLMs used (e.g., 'GPT-4 Turbo (gpt-4-1106-preview)', 'Gemini Pro', 'Mixtral 8x7B') but does not specify any hardware details (e.g., GPU models, CPU types, memory) used to run the experiments or interact with these models.
Software Dependencies No The paper mentions specific LLM models and versions, such as 'GPT-4 Turbo (gpt-4-1106-preview)' and 'GPT-3.5 Turbo (gpt-3.5-turbo-0613)', and references 'Open AI s API' for logprobs. However, it does not provide version numbers for ancillary software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other frameworks used for data processing, analysis, or interaction beyond the LLM APIs themselves.
Experiment Setup Yes The prompt consists of a prefix with three sentences that describe the task and a Geo LLM prompt that provides spatial context for the respective coordinates as well as the name of the topic and rating scale. An example of a prompt is shown in Figure 2. ... The first method is to simply get the most probable rating. Since there are only 3 tokens total required for a rating (e.g. 6.7 ) with the first token (first digit) being the most important, greedy sampling (temperature of 0.0) likely leads to the most probable rating. ... To visualize an LLM s ratings on a global scale, we select 2000 prompts aiming for a good balance between relevant locations as well as good geographical coverage.