Limits of Transformer Language Models on Learning to Compose Algorithms
Authors: Jonathan Thomm, Giacomo Camposampiero, Aleksandar Terzic, Michael Hersche, Bernhard Schölkopf, Abbas Rahimi
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To this end, we evaluate training LLa MA models and prompting GPT-4 and Gemini on four tasks demanding to learn a composition of several discrete sub-tasks. In particular, we measure how well these models can reuse primitives observable in the sub-tasks to learn the composition task. Our results indicate that compositional learning in state-of-the-art Transformer language models is highly sample inefficient |
| Researcher Affiliation | Collaboration | Jonathan Thomm1,2 jthomm@ethz.ch Giacomo Camposampiero1,2 giacomo.camposampiero1@ibm.com Aleksandar Terzic1,2 aleksandar.terzic1@ibm.com Michael Hersche1 michael.hersche@ibm.com Bernhard Schölkopf2,3 bs@tuebingen.mpg.de Abbas Rahimi1 abr@zurich.ibm.com 1IBM Research Zurich, 2ETH Zurich, 3MPI Tübingen |
| Pseudocode | Yes | def PEN(seq): res = seq[1] while(res[-1] !s= EOS): last_match = left(res[-1], seq) new_match = match(last_match, seq) right_match = right(new_match, seq) res.append(right_match) |
| Open Source Code | Yes | We open source our code at https://github.com/IBM/ limitations-lm-algorithmic-compositional-learning. |
| Open Datasets | No | The paper introduces new synthetic tasks (PEN, PERM) and uses tasks from prior work (HSS, MUL) to generate its own datasets. While it mentions the code for generation is open-source, it does not provide concrete access information (e.g., a direct link, DOI, or citation) to the generated datasets themselves for public download. |
| Dataset Splits | No | The paper mentions training and testing but does not specify explicit train/validation/test splits with percentages or sample counts, nor does it define a dedicated validation set. |
| Hardware Specification | No | The paper does not provide specific hardware details such as CPU/GPU models, processor types, or memory used for experiments. It only mentions model size and training parameters. |
| Software Dependencies | No | The paper mentions using a 'LLa MA model' but does not specify software dependencies with version numbers (e.g., PyTorch, TensorFlow, or specific Python libraries). |
| Experiment Setup | Yes | We use a batch size of 384 samples, which corresponds to ca. 250 K tokens per batch for the PEN task and 50 K for the PERM task. Our 150 M parameter model contains of 12 layers and a hidden size of 1024. The learning rate is 10^-4, as in the original paper [2]. |