Approaching Human-Level Forecasting with Language Models

Authors: Danny Halawi, Fred Zhang, Chen Yueh-Han, Jacob Steinhardt

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it.
Researcher Affiliation Academia Danny Halawi* Fred Zhang* Chen Yueh-Han* Jacob Steinhardt UC Berkeley {dhalawi, z0, john0922ucb, jsteinhardt}@berkeley.edu
Pseudocode No The paper includes diagrams (e.g., Figure 1) to illustrate the system's architecture, but it does not contain any formal pseudocode blocks or algorithms labeled as such.
Open Source Code Yes We also attach the code and data in our submission.
Open Datasets Yes To facilitate further research, we release our dataset: the largest and most recent forecasting dataset compiled from 5 real-world forecasting competitions.
Dataset Splits Yes This yields a set of 5,516 binary questions, including 3,762 for training, 840 for validation, and 914 for testing (Table 2a).
Hardware Specification No The paper mentions using specific language models via API access (e.g., Together AI's serving API), which implies cloud infrastructure, but it does not specify the underlying hardware (e.g., GPU models, CPU types, or detailed cloud instance specifications) used for running the experiments or model inferences beyond stating API costs.
Software Dependencies Yes We evaluate 14 instruction-tuned LMs: GPT-3.5-Turbo-1106, GPT-3.5-Turbo-Instruct (Brown et al., 2020); GPT-4-0613, GPT-4-1106-Preview (Open AI, 2023); Llama-2-7B, Llama-2-13B, Llama-2-70B (Touvron et al., 2023); Mistral-7B-Instruct, Mistral-8x7B-Instruct (Jiang et al., 2024), Nous Hermes 2 Mixtral-8x7B-DPO, Yi-34B-Chat, Claude-2, Claude-2.1 (Anthropic, 2023), and Gemini-Pro (Gemini Team, 2023).
Experiment Setup Yes To create a set of retrieval dates for each question, we use geometrically increasing time points between the open and close dates. We use n = 5 retrieval dates per question; the kth retrieval date is given by retrieval_datek = datebegin + (dateclose datebegin 1)k/n.