reproducibilityindex.ai

Faith and Fate: Limits of Transformers on Compositionality

Authors: Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang (Lorraine) Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, Yejin Choi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching, without necessarily developing systematic problem-solving skills. To round off our empirical study, we provide theoretical arguments on abstract multi-step reasoning problems that highlight how autoregressive generations performance can rapidly decay with increased task complexity.
Researcher Affiliation	Collaboration	Nouha Dziri1 , Ximing Lu1,2 , Melanie Sclar2 , Xiang Lorraine Li1 , Liwei Jiang1,2 , Bill Yuchen Lin1 , Peter West1,2, Chandra Bhagavatula1, Ronan Le Bras1, Jena D. Hwang1, Soumya Sanyal3, Sean Welleck1,2, Xiang Ren1,3, Allyson Ettinger1,4, Zaid Harchaoui1,2, Yejin Choi1,2 1Allen Institute for Artificial Intelligence 2University of Washington 3University of Southern California 4University of Chicago
Pseudocode	Yes	Algorithm 1 Puzzle Solver
Open Source Code	Yes	Code and data are available at https://github.com/nouhadziri/faith-and-fate
Open Datasets	Yes	We exhaustively generate multiplication problems as question-answer pairs (e.g., Q: What is 4 times 32? A: 128 ). We exhaustively generate data for this DP task. For question-answer setting, we include a thorough explanation of the task before asking to generate a solution.
Dataset Splits	Yes	In multiplication and DP, we finetune models with all enumerations of questions up to the maximum problem size4 within reasonable training budget, leaving out 10% for validation and 10% for testing.
Hardware Specification	No	The paper states that evaluations were conducted "using the Open AI API" and mentions fine-tuning GPT3, but it does not specify the underlying hardware (e.g., GPU models, CPU types) used for these operations.
Software Dependencies	No	The paper mentions specific models like "GPT4 (gpt-4) [58], Chat GPT (GPT-3.5-turbo) [57] and GPT3 (text-davinci-003) [11], Flan T5 [17] and LLa Ma [75]" but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	For question-answer pairs fine-tuning, we train separately the model for {14, 12, 4} epochs for multiplication, puzzle, and DP respectively, saving the best model based on the validation set. Regarding training on question-scratchpad pairs, we train the model for {16, 8, 2} epochs for multiplication, puzzle, and DP. The batch size is set to approximately 0.2% of the number of examples in the training set. For the learning rate multiplier, we experiment with values ranging from 0.02 to 0.2 to determine the optimal setting for achieving the best results and chose 0.2. During inference, we set nucleus sampling p to 0.7 and temperature to 1.