Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning to Reason and Memorize with Self-Notes

Authors: Jack Lanchantin, Shubham Toshniwal, Jason Weston, arthur szlam, Sainbayar Sukhbaatar

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across a wide variety of tasks demonstrate that our method can outperform chain-of-thought and scratchpad methods by taking Self-Notes that interleave the input text.
Researcher Affiliation Industry Jack Lanchantin Meta AI Shubham Toshniwal NVIDIA Jason Weston Meta AI Arthur Szlam Meta AI Sainbayar Sukhbaatar Meta AI
Pseudocode No The paper describes its methods verbally and with examples but does not include any structured pseudocode or algorithm blocks.
Open Source Code No Reproducibility statement: We will make code and data publicly available.
Open Datasets Yes We test our method on seven text datasets designed to evaluate multi-step reasoning and state-tracking: a proposed synthetic Toy-Story task, two synthetic program evaluation tasks [11, 16], two real-world chess game tasks [17], and two math word problem tasks previously used to test chain-of-thought prompting, Multi Arith and GSM8K [18, 19].
Dataset Splits Yes Table 8: Dataset Statistics. # train # valid # test In domain Out-of domain
Hardware Specification Yes We fine-tune all of the GPT-2 models on 8 NVIDIA V100 GPUs using an on-site cluster.
Software Dependencies Yes The GSM8K experiments were done using the text-davinci-003 model with the Open AI API
Experiment Setup Yes For each non-prompting task, we train for a fixed 30 epochs with a learning rate of 2e-5 and batch size of 32.