Position: Data-driven Discovery with Large Generative Models

Authors: Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Sanchaita Hazra, Ashish Sabharwal, Peter Clark

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Then, through DATAVOYAGER, a proof-of-concept utilizing GPT-4, we demonstrate how LGMs fulfill several of these desiderata a feat previously unattainable while also highlighting important limitations in the current system that open up opportunities for novel ML research.
Researcher Affiliation Collaboration 1Allen Institute for AI 2Open Locus 3University of Massachusetts Amherst 4University of Utah. Correspondence to: Bodhisattwa Prasad Majumder <bodhisattwam@allenai.org>, Harshit Surana <harshit@openlocus.ai>.
Pseudocode No The paper includes code snippets within the text (e.g., Python code for statistical tests) and diagrams illustrating system architecture, but it does not present any clearly labeled pseudocode blocks or algorithms.
Open Source Code No The paper does not contain any explicit statements about releasing source code for the DATAVOYAGER system or provide a direct link to a code repository for its methodology.
Open Datasets Yes For example, Smith et al. (2005) explored the link between time preference and BMI from the National Longitudinal Surveys using several variables indicating the saving behavior of the respondents. ... National Longitudinal Survey of Youth data with a question on how incarceration and race affected wealth was fed to DATAVOYAGER; it is a question studied in (Zaw et al., 2016).
Dataset Splits No The paper does not explicitly provide specific percentages, sample counts, or methodologies for dataset splits (e.g., training, validation, test sets) to reproduce the data partitioning.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper mentions statistical analysis tools and Python for code generation, but it does not list specific software libraries or dependencies with their version numbers required to replicate the experiments.
Experiment Setup No The paper does not specify concrete experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or specific training configurations.