ALPINE: Unveiling The Planning Capability of Autoregressive Learning in Language Models

Authors: Siwei Wang, Yifei Shen, Shi Feng, Haoran Sun, Shang-Hua Teng, Wei Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We initiate a theoretical investigation into the emergence of planning capabilities in Transformer-based LLMs via their next-word prediction mechanisms. We model planning as a network path-finding task, where the objective is to generate a valid path from a specified source node to a designated target node. Our mathematical characterization shows that Transformer architectures can execute path-finding by embedding the adjacency and reachability matrices within their weights. Furthermore, our theoretical analysis of gradient-based learning dynamics reveals that LLMs can learn both the adjacency and a limited form of the reachability matrices. These theoretical insights are then validated through experiments, which demonstrate that Transformer architectures indeed learn the adjacency and an incomplete reachability matrices, consistent with our theoretical predictions. When applying our methodology to the real-world planning benchmark Blocksworld, our observations remain consistent.
Researcher Affiliation Collaboration Siwei Wang1 Yifei Shen1 Shi Feng2 Haoran Sun3 Shang-Hua Teng4 Wei Chen1 1Microsoft Research Asia ({siweiwang, yifeishen, weic}@microsoft.com) 2Harvard University (shifeng@fas.harvard.edu) 3Peking University (sunhaoran0301@stu.pku.edu.cn) 4University of Southern California (shanghua@usc.edu)
Pseudocode Yes Algorithm 1 A handcrafted path-finding algorithm 1: Input: Adjacency matrix A, reachability matrix R, source node s, target node t 2: Set path P = [s t s] and set current node i = s 3: while i = t do 4: Obtain S = {k|A(i,k) = 1 and R(t,k) = 1} 5: Randomly sample next node k from S 6: Append k to path P, and set i = k 7: end while 8: output path P
Open Source Code Yes Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code is uploaded in the supplementary material.
Open Datasets No The paper states that the datasets used for experiments were 'generated from a directed graph G = (V, E)' (Section 2.2) and 'The DAG is generated randomly' (Section 4), or 'we construct a graph GBW' (Appendix F). While the methodology for generating the data is described and presumably available via the supplementary code, the datasets themselves are not presented as pre-existing, publicly available benchmarks with direct links or formal citations to download them as distinct entities.
Dataset Splits No Section 4 states: 'Then these reachable pairs are separated into the training set (w.p. 0.5) and the test set (w.p. 0.5)'. Appendix F mentions: 'We randomly select 80% of all node pairs for training and the rest 20% for testing'. The paper clearly defines training and testing splits but does not explicitly mention or quantify a separate validation set split percentage for the main experiments.
Hardware Specification Yes In this section, we state the complete experimental results on the synthetic datasets. All these results are conducted on a single A100 GPU.
Software Dependencies No The paper does not explicitly list specific software dependencies with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8'). While it uses a Transformer architecture and implicitly relies on deep learning frameworks, no specific versions are provided for reproducibility beyond the general mention of a 'standard GPT architecture'.
Experiment Setup Yes In our experiments, we employ Transformer models with an embedding size of d = 120. We conduct tests using various configurations, ranging from 1-layer and 1-head to 6-layer and 6-head, while considering different graph sizes, with number of nodes n ranging from 100 to 500.