Toolformer: Language Models Can Teach Themselves to Use Tools

Authors: Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on a variety of different downstream tasks, demonstrating that after learning to use tools, Toolformer, which is based on a pretrained GPT-J model (Wang and Komatsuzaki, 2021) with 6.7B parameters, achieves much stronger zero-shot results, clearly outperforming a much larger GPT-3 model (Brown et al., 2020) and several other baselines on various tasks.
Researcher Affiliation Collaboration FAIR, Meta Universitat Pompeu Fabra
Pseudocode No The paper contains diagrams illustrating steps (e.g., Figure 2) but no formal pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link to the open-source code for the Toolformer methodology.
Open Datasets Yes We use a subset of CCNet (Wenzek et al., 2020) as our dataset C and GPT-J (Wang and Komatsuzaki, 2021) as our language model M. ... We evaluate our models on two language modeling datasets: Wiki Text (Merity et al., 2017) and a subset of 10,000 randomly selected documents from CCNet (Wenzek et al., 2020) that were not used during training.
Dataset Splits Yes We evaluate our models on two language modeling datasets: Wiki Text (Merity et al., 2017) and a subset of 10,000 randomly selected documents from CCNet (Wenzek et al., 2020) that were not used during training.
Hardware Specification No The paper does not specify the hardware used for running experiments (e.g., specific GPU models, CPU, or memory).
Software Dependencies No The paper mentions models like GPT-J and NLLB and tools like BM25, but does not provide specific software dependencies with version numbers.
Experiment Setup Yes We finetune M on C using a batch size of 128 and a learning rate of 1 × 10−5 with linear warmup for the first 10% of training. Finetuning details are given in Appendix B. ... We finetune all models for 100k training steps with a batch size of 128 and a linear learning rate schedule with warmup for the first 10% of training and a maximum learning rate of 1 × 10−5.