Learning to Reason via Program Generation, Emulation, and Search

Authors: Nathaniel Weir, Muhammad Khalifa, Linlu Qiu, Orion Weller, Peter Clark

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate over a diverse suite of reasoning tasks, including commonsense QA, text classification, and math datasets. These datasets cover symbolic tasks that could conceptually benefit from programmatic operations and standard natural language tasks whose solutions might not be easily described in code. We find that applying COTACS leads the COGEX models to substantially outperform the comparable NL-based LM using the same original checkpoint and the same set of training examples available for in-context learning, even in the few-shot regime. COTACS thus gives us one way to fit a model to a new dataset without having to perform any gradient descent or parameter updates, both for algorithmic and softer reasoning datasets.
Researcher Affiliation Collaboration Nathaniel Weir Johns Hopkins University nweir@jhu.edu Muhammad Khalifa University of Michigan khalifam@umich.edu Linlu Qiu MIT linluqiu@mit.edu Orion Weller Johns Hopkins University oweller@cs.jhu.edu Peter Clark Allen Institute for AI peterc@allenai.org
Pseudocode Yes Algorithm 1: COTACS search that identifies a set of k programs PD that best adapts a COGEX model to new dataset D
Open Source Code Yes Our released dataset, fine-tuned models, and implementation can be found at https://github.com/ nweir127/Co GEX.
Open Datasets Yes We train COGEX models by adapting the recent Alpaca instruction tuning dataset (Taori et al., 2023) into a set of analogous Pythonic examples by prompting GPT-4 to perform the conversion, and then use the resulting COGEX dataset to fine-tune smaller (7B and 13B) LMs to answer instructions via code.
Dataset Splits Yes As a validation set, we randomly sample 2K examples from the training set and keep the checkpoint with the lowest perplexity on the validation set for testing.
Hardware Specification Yes Model training was done on a server with 128GB of RAM and 2 Nvidia A6000 48GB GPUs.
Software Dependencies No We use parameter-efficient training via Low-rank adaptation (Lo RA) (Hu et al., 2021) with a rank of r = 16, dropout rate of 0.05, and Lo RA weights added to the query, key, value, and output matrices in all layers.
Experiment Setup Yes We use parameter-efficient training via Low-rank adaptation (Lo RA) (Hu et al., 2021) with a rank of r = 16, dropout rate of 0.05, and Lo RA weights added to the query, key, value, and output matrices in all layers. We train all models for five epochs using a batch size of 32 and a learning rate of 0.0003.