Limits of Transformer Language Models on Learning to Compose Algorithms

Authors: Jonathan Thomm, Giacomo Camposampiero, Aleksandar Terzic, Michael Hersche, Bernhard Schölkopf, Abbas Rahimi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To this end, we evaluate training LLa MA models and prompting GPT-4 and Gemini on four tasks demanding to learn a composition of several discrete sub-tasks. In particular, we measure how well these models can reuse primitives observable in the sub-tasks to learn the composition task. Our results indicate that compositional learning in state-of-the-art Transformer language models is highly sample inefficient
Researcher Affiliation Collaboration Jonathan Thomm1,2 jthomm@ethz.ch Giacomo Camposampiero1,2 giacomo.camposampiero1@ibm.com Aleksandar Terzic1,2 aleksandar.terzic1@ibm.com Michael Hersche1 michael.hersche@ibm.com Bernhard Schölkopf2,3 bs@tuebingen.mpg.de Abbas Rahimi1 abr@zurich.ibm.com 1IBM Research Zurich, 2ETH Zurich, 3MPI Tübingen
Pseudocode Yes def PEN(seq): res = seq[1] while(res[-1] !s= EOS): last_match = left(res[-1], seq) new_match = match(last_match, seq) right_match = right(new_match, seq) res.append(right_match)
Open Source Code Yes We open source our code at https://github.com/IBM/ limitations-lm-algorithmic-compositional-learning.
Open Datasets No The paper introduces new synthetic tasks (PEN, PERM) and uses tasks from prior work (HSS, MUL) to generate its own datasets. While it mentions the code for generation is open-source, it does not provide concrete access information (e.g., a direct link, DOI, or citation) to the generated datasets themselves for public download.
Dataset Splits No The paper mentions training and testing but does not specify explicit train/validation/test splits with percentages or sample counts, nor does it define a dedicated validation set.
Hardware Specification No The paper does not provide specific hardware details such as CPU/GPU models, processor types, or memory used for experiments. It only mentions model size and training parameters.
Software Dependencies No The paper mentions using a 'LLa MA model' but does not specify software dependencies with version numbers (e.g., PyTorch, TensorFlow, or specific Python libraries).
Experiment Setup Yes We use a batch size of 384 samples, which corresponds to ca. 250 K tokens per batch for the PEN task and 50 K for the PERM task. Our 150 M parameter model contains of 12 layers and a hidden size of 1024. The learning rate is 10^-4, as in the original paper [2].