Approaching Human-Level Forecasting with Language Models
Authors: Danny Halawi, Fred Zhang, Chen Yueh-Han, Jacob Steinhardt
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it. |
| Researcher Affiliation | Academia | Danny Halawi* Fred Zhang* Chen Yueh-Han* Jacob Steinhardt UC Berkeley {dhalawi, z0, john0922ucb, jsteinhardt}@berkeley.edu |
| Pseudocode | No | The paper includes diagrams (e.g., Figure 1) to illustrate the system's architecture, but it does not contain any formal pseudocode blocks or algorithms labeled as such. |
| Open Source Code | Yes | We also attach the code and data in our submission. |
| Open Datasets | Yes | To facilitate further research, we release our dataset: the largest and most recent forecasting dataset compiled from 5 real-world forecasting competitions. |
| Dataset Splits | Yes | This yields a set of 5,516 binary questions, including 3,762 for training, 840 for validation, and 914 for testing (Table 2a). |
| Hardware Specification | No | The paper mentions using specific language models via API access (e.g., Together AI's serving API), which implies cloud infrastructure, but it does not specify the underlying hardware (e.g., GPU models, CPU types, or detailed cloud instance specifications) used for running the experiments or model inferences beyond stating API costs. |
| Software Dependencies | Yes | We evaluate 14 instruction-tuned LMs: GPT-3.5-Turbo-1106, GPT-3.5-Turbo-Instruct (Brown et al., 2020); GPT-4-0613, GPT-4-1106-Preview (Open AI, 2023); Llama-2-7B, Llama-2-13B, Llama-2-70B (Touvron et al., 2023); Mistral-7B-Instruct, Mistral-8x7B-Instruct (Jiang et al., 2024), Nous Hermes 2 Mixtral-8x7B-DPO, Yi-34B-Chat, Claude-2, Claude-2.1 (Anthropic, 2023), and Gemini-Pro (Gemini Team, 2023). |
| Experiment Setup | Yes | To create a set of retrieval dates for each question, we use geometrically increasing time points between the open and close dates. We use n = 5 retrieval dates per question; the kth retrieval date is given by retrieval_datek = datebegin + (dateclose datebegin 1)k/n. |