reproducibilityindex.ai

Large Language Models are Geographically Biased

Authors: Rohin Manvi, Samar Khanna, Marshall Burke, David B. Lobell, Stefano Ermon

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show various problematic geographic biases, which we define as systemic errors in geospatial predictions. Initially, we demonstrate that LLMs are capable of making accurate zero-shot geospatial predictions in the form of ratings that show strong monotonic correlation with ground truth (Spearman s ρ of up to 0.89). We then show that LLMs exhibit common biases across a range of objective and subjective topics. In particular, LLMs are clearly biased against locations with lower socioeconomic conditions (e.g. most of Africa) on a variety of sensitive subjective topics such as attractiveness, morality, and intelligence (Spearman s ρ of up to 0.70). Finally, we introduce a bias score to quantify this and find that there is significant variation in the magnitude of bias across existing LLMs.
Researcher Affiliation	Academia	Rohin Manvi 1 Samar Khanna 1 Marshall Burke 1 David Lobell 1 Stefano Ermon 1 ... 1Stanford University. Correspondence to: Rohin Manvi <rohinm@cs.stanford.edu>.
Pseudocode	No	The paper describes its methods in prose and does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code	Yes	Code is available on the project website: https://rohinmanvi.github.io/Geo LLM
Open Datasets	Yes	This includes Infant Mortality Rate (CIESIN, 2021), Population Density (Tatem, 2017), Built-Up to Non Built-Up Area Ratio (JRC & CIESIN, 2021; Florczyk et al., 2019), Nighttime Light Intensity (Elvidge et al., 2017), Average Temperature (Karger et al., 2018), and Annual Precipitation (Karger et al., 2018).
Dataset Splits	No	The paper evaluates LLMs in a zero-shot setting and collects 2000 prompts for evaluation, stating: 'To visualize an LLM s ratings on a global scale, we select 2000 prompts aiming for a good balance between relevant locations as well as good geographical coverage.' However, it does not specify explicit training, validation, or test splits in the traditional sense, as the models being evaluated are pre-trained and used directly without further training within this study.
Hardware Specification	No	The paper mentions the specific LLMs used (e.g., 'GPT-4 Turbo (gpt-4-1106-preview)', 'Gemini Pro', 'Mixtral 8x7B') but does not specify any hardware details (e.g., GPU models, CPU types, memory) used to run the experiments or interact with these models.
Software Dependencies	No	The paper mentions specific LLM models and versions, such as 'GPT-4 Turbo (gpt-4-1106-preview)' and 'GPT-3.5 Turbo (gpt-3.5-turbo-0613)', and references 'Open AI s API' for logprobs. However, it does not provide version numbers for ancillary software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other frameworks used for data processing, analysis, or interaction beyond the LLM APIs themselves.
Experiment Setup	Yes	The prompt consists of a prefix with three sentences that describe the task and a Geo LLM prompt that provides spatial context for the respective coordinates as well as the name of the topic and rating scale. An example of a prompt is shown in Figure 2. ... The first method is to simply get the most probable rating. Since there are only 3 tokens total required for a rating (e.g. 6.7 ) with the first token (first digit) being the most important, greedy sampling (temperature of 0.0) likely leads to the most probable rating. ... To visualize an LLM s ratings on a global scale, we select 2000 prompts aiming for a good balance between relevant locations as well as good geographical coverage.