Faith and Fate: Limits of Transformers on Compositionality

Authors: Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang (Lorraine) Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, Yejin Choi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching, without necessarily developing systematic problem-solving skills. To round off our empirical study, we provide theoretical arguments on abstract multi-step reasoning problems that highlight how autoregressive generations performance can rapidly decay with increased task complexity.
Researcher Affiliation Collaboration Nouha Dziri1 , Ximing Lu1,2 , Melanie Sclar2 , Xiang Lorraine Li1 , Liwei Jiang1,2 , Bill Yuchen Lin1 , Peter West1,2, Chandra Bhagavatula1, Ronan Le Bras1, Jena D. Hwang1, Soumya Sanyal3, Sean Welleck1,2, Xiang Ren1,3, Allyson Ettinger1,4, Zaid Harchaoui1,2, Yejin Choi1,2 1Allen Institute for Artificial Intelligence 2University of Washington 3University of Southern California 4University of Chicago
Pseudocode Yes Algorithm 1 Puzzle Solver
Open Source Code Yes Code and data are available at https://github.com/nouhadziri/faith-and-fate
Open Datasets Yes We exhaustively generate multiplication problems as question-answer pairs (e.g., Q: What is 4 times 32? A: 128 ). We exhaustively generate data for this DP task. For question-answer setting, we include a thorough explanation of the task before asking to generate a solution.
Dataset Splits Yes In multiplication and DP, we finetune models with all enumerations of questions up to the maximum problem size4 within reasonable training budget, leaving out 10% for validation and 10% for testing.
Hardware Specification No The paper states that evaluations were conducted "using the Open AI API" and mentions fine-tuning GPT3, but it does not specify the underlying hardware (e.g., GPU models, CPU types) used for these operations.
Software Dependencies No The paper mentions specific models like "GPT4 (gpt-4) [58], Chat GPT (GPT-3.5-turbo) [57] and GPT3 (text-davinci-003) [11], Flan T5 [17] and LLa Ma [75]" but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes For question-answer pairs fine-tuning, we train separately the model for {14, 12, 4} epochs for multiplication, puzzle, and DP respectively, saving the best model based on the validation set. Regarding training on question-scratchpad pairs, we train the model for {16, 8, 2} epochs for multiplication, puzzle, and DP. The batch size is set to approximately 0.2% of the number of examples in the training set. For the learning rate multiplier, we experiment with values ranging from 0.02 to 0.2 to determine the optimal setting for achieving the best results and chose 0.2. During inference, we set nucleus sampling p to 0.7 and temperature to 1.