Toolformer: Language Models Can Teach Themselves to Use Tools
Authors: Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on a variety of different downstream tasks, demonstrating that after learning to use tools, Toolformer, which is based on a pretrained GPT-J model (Wang and Komatsuzaki, 2021) with 6.7B parameters, achieves much stronger zero-shot results, clearly outperforming a much larger GPT-3 model (Brown et al., 2020) and several other baselines on various tasks. |
| Researcher Affiliation | Collaboration | FAIR, Meta Universitat Pompeu Fabra |
| Pseudocode | No | The paper contains diagrams illustrating steps (e.g., Figure 2) but no formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link to the open-source code for the Toolformer methodology. |
| Open Datasets | Yes | We use a subset of CCNet (Wenzek et al., 2020) as our dataset C and GPT-J (Wang and Komatsuzaki, 2021) as our language model M. ... We evaluate our models on two language modeling datasets: Wiki Text (Merity et al., 2017) and a subset of 10,000 randomly selected documents from CCNet (Wenzek et al., 2020) that were not used during training. |
| Dataset Splits | Yes | We evaluate our models on two language modeling datasets: Wiki Text (Merity et al., 2017) and a subset of 10,000 randomly selected documents from CCNet (Wenzek et al., 2020) that were not used during training. |
| Hardware Specification | No | The paper does not specify the hardware used for running experiments (e.g., specific GPU models, CPU, or memory). |
| Software Dependencies | No | The paper mentions models like GPT-J and NLLB and tools like BM25, but does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | We finetune M on C using a batch size of 128 and a learning rate of 1 × 10−5 with linear warmup for the first 10% of training. Finetuning details are given in Appendix B. ... We finetune all models for 100k training steps with a batch size of 128 and a linear learning rate schedule with warmup for the first 10% of training and a maximum learning rate of 1 × 10−5. |