Can Language Models Solve Graph Problems in Natural Language?

Authors: Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan Tan, Xiaochuang Han, Yulia Tsvetkov

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate LLMs (GPT-3/4) with various prompting approaches on the NLGraph benchmark and find that 1) language models do demonstrate preliminary graph reasoning abilities
Researcher Affiliation Academia 1Xi an Jiaotong University 2University of Washington 3University of Notre Dame
Pseudocode No The paper provides examples of prompts and algorithmic reasoning in Figure 5, but these are illustrative of the prompting approach, not structured pseudocode or algorithm blocks for the methods themselves.
Open Source Code Yes The NLGraph benchmark and evaluation code are available at https://github.com/Arthur-Heng/NLGraph.
Open Datasets Yes To this end, we propose the Natural Language Graph (NLGraph) benchmark, a comprehensive testbed of graph and structured reasoning designed for language models and in natural language. NLGraph contains a total of 29,370 problems... The NLGraph benchmark and evaluation code are available at https://github.com/Arthur-Heng/NLGraph.
Dataset Splits No The paper divides problems into 'easy, medium, and hard subsets' for fine-grained analysis and evaluates on a 'standard set' vs. 'extended version'. However, it does not specify explicit training/validation/test splits with percentages or counts for model development or hyperparameter tuning. It mainly evaluates pre-existing LLMs with various prompting techniques.
Hardware Specification No The paper does not specify any particular hardware used for running the experiments (e.g., specific GPU or CPU models, cloud computing instances with detailed specs). It only mentions using various Large Language Models (LLMs) such as TEXT-DAVINCI-003, GPT-3.5-TURBO, CODE-DAVINCI-002, and GPT-4.
Software Dependencies No The paper mentions various LLMs (TEXT-DAVINCI-003, GPT-3.5-TURBO, CODE-DAVINCI-002, GPT-4, OPT-2.7B) and prompting techniques but does not specify any software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes For all baselines except self-consistency, we set temperature τ = 0; For self-consistency prompting, we sample five chain-of-thought responses with temperature τ = 0.7. For few-shot prompting techniques... the input prompt includes k exemplars... For the connectivity task and cycle task, we set k to 4, for the GNN task, we set k to 1 due to the context size limit, while for other tasks k is 5.