Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning to Reason and Memorize with Self-Notes
Authors: Jack Lanchantin, Shubham Toshniwal, Jason Weston, arthur szlam, Sainbayar Sukhbaatar
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across a wide variety of tasks demonstrate that our method can outperform chain-of-thought and scratchpad methods by taking Self-Notes that interleave the input text. |
| Researcher Affiliation | Industry | Jack Lanchantin Meta AI Shubham Toshniwal NVIDIA Jason Weston Meta AI Arthur Szlam Meta AI Sainbayar Sukhbaatar Meta AI |
| Pseudocode | No | The paper describes its methods verbally and with examples but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | Reproducibility statement: We will make code and data publicly available. |
| Open Datasets | Yes | We test our method on seven text datasets designed to evaluate multi-step reasoning and state-tracking: a proposed synthetic Toy-Story task, two synthetic program evaluation tasks [11, 16], two real-world chess game tasks [17], and two math word problem tasks previously used to test chain-of-thought prompting, Multi Arith and GSM8K [18, 19]. |
| Dataset Splits | Yes | Table 8: Dataset Statistics. # train # valid # test In domain Out-of domain |
| Hardware Specification | Yes | We fine-tune all of the GPT-2 models on 8 NVIDIA V100 GPUs using an on-site cluster. |
| Software Dependencies | Yes | The GSM8K experiments were done using the text-davinci-003 model with the Open AI API |
| Experiment Setup | Yes | For each non-prompting task, we train for a fixed 30 epochs with a learning rate of 2e-5 and batch size of 32. |