Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Toolformer: Language Models Can Teach Themselves to Use Tools
Authors: Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on a variety of different downstream tasks, demonstrating that after learning to use tools, Toolformer, which is based on a pretrained GPT-J model (Wang and Komatsuzaki, 2021) with 6.7B parameters, achieves much stronger zero-shot results, clearly outperforming a much larger GPT-3 model (Brown et al., 2020) and several other baselines on various tasks. |
| Researcher Affiliation | Collaboration | FAIR, Meta Universitat Pompeu Fabra |
| Pseudocode | No | The paper contains diagrams illustrating steps (e.g., Figure 2) but no formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link to the open-source code for the Toolformer methodology. |
| Open Datasets | Yes | We use a subset of CCNet (Wenzek et al., 2020) as our dataset C and GPT-J (Wang and Komatsuzaki, 2021) as our language model M. ... We evaluate our models on two language modeling datasets: Wiki Text (Merity et al., 2017) and a subset of 10,000 randomly selected documents from CCNet (Wenzek et al., 2020) that were not used during training. |
| Dataset Splits | Yes | We evaluate our models on two language modeling datasets: Wiki Text (Merity et al., 2017) and a subset of 10,000 randomly selected documents from CCNet (Wenzek et al., 2020) that were not used during training. |
| Hardware Specification | No | The paper does not specify the hardware used for running experiments (e.g., specific GPU models, CPU, or memory). |
| Software Dependencies | No | The paper mentions models like GPT-J and NLLB and tools like BM25, but does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | We finetune M on C using a batch size of 128 and a learning rate of 1 × 10−5 with linear warmup for the first 10% of training. Finetuning details are given in Appendix B. ... We finetune all models for 100k training steps with a batch size of 128 and a linear learning rate schedule with warmup for the first 10% of training and a maximum learning rate of 1 × 10−5. |