reproducibilityindex.ai

Approaching Human-Level Forecasting with Language Models

Authors: Danny Halawi, Fred Zhang, Chen Yueh-Han, Jacob Steinhardt

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it.
Researcher Affiliation	Academia	Danny Halawi* Fred Zhang* Chen Yueh-Han* Jacob Steinhardt UC Berkeley {dhalawi, z0, john0922ucb, jsteinhardt}@berkeley.edu
Pseudocode	No	The paper includes diagrams (e.g., Figure 1) to illustrate the system's architecture, but it does not contain any formal pseudocode blocks or algorithms labeled as such.
Open Source Code	Yes	We also attach the code and data in our submission.
Open Datasets	Yes	To facilitate further research, we release our dataset: the largest and most recent forecasting dataset compiled from 5 real-world forecasting competitions.
Dataset Splits	Yes	This yields a set of 5,516 binary questions, including 3,762 for training, 840 for validation, and 914 for testing (Table 2a).
Hardware Specification	No	The paper mentions using specific language models via API access (e.g., Together AI's serving API), which implies cloud infrastructure, but it does not specify the underlying hardware (e.g., GPU models, CPU types, or detailed cloud instance specifications) used for running the experiments or model inferences beyond stating API costs.
Software Dependencies	Yes	We evaluate 14 instruction-tuned LMs: GPT-3.5-Turbo-1106, GPT-3.5-Turbo-Instruct (Brown et al., 2020); GPT-4-0613, GPT-4-1106-Preview (Open AI, 2023); Llama-2-7B, Llama-2-13B, Llama-2-70B (Touvron et al., 2023); Mistral-7B-Instruct, Mistral-8x7B-Instruct (Jiang et al., 2024), Nous Hermes 2 Mixtral-8x7B-DPO, Yi-34B-Chat, Claude-2, Claude-2.1 (Anthropic, 2023), and Gemini-Pro (Gemini Team, 2023).
Experiment Setup	Yes	To create a set of retrieval dates for each question, we use geometrically increasing time points between the open and close dates. We use n = 5 retrieval dates per question; the kth retrieval date is given by retrieval_datek = datebegin + (dateclose datebegin 1)k/n.