TART: A plug-and-play Transformer module for task-agnostic reasoning

Authors: Kush Bhatia, Avanika Narayan, Christopher M. De Sa, Christopher RĂ©

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments actually reveal that LLM representations contain sufficient information to make good predictions. As such, we focus on the LLM s reasoning abilities and demonstrate that this performance gap exists due to their inability to perform simple probabilistic reasoning tasks.
Researcher Affiliation Academia {kushb, avanika, chrismre}@cs.stanford.edu, cdesa@cs.cornell.edu
Pseudocode Yes Figure 2: TART. (Left) Inference module training procedure: The inference module is trained on sequences of synthetically generated logistic regression tasks. (Right) End-to-end framework: TART composes a pre-trained LLM with the inference module.
Open Source Code Yes Our code and model is available at https://github.com/HazyResearch/TART
Open Datasets Yes The evaluation datasets include: SST [35], Rotten Tomatoes [29], SMS Spam [3], IMDB [27], Civil Comments [8], AGNews [48], DBPedia [48], and the Youtube dataset [46].
Dataset Splits No The paper describes sampling training sets and performing hyperparameter searches, but does not explicitly define or specify the details of a separate validation split (e.g., percentages or counts) used for this purpose.
Hardware Specification Yes We use a single NVIDIA RTX A6000 GPU ( $2/hr) for 6 hours to train our TART inference head, costing a total of $18 to train. We use 4 NVIDIA A100 GPU s ( $3.5 per GPU/hr) for 100 hours for hyperparamter tuning, costing a total of $1,400. We use 8 NVIDIA A100 GPU s ( $3.5 per GPU/hr) for 50 hours to fine-tune all task-adaptation baseline models.
Software Dependencies No The paper mentions 'scikit-learn python library' but does not specify a version number for it or other software dependencies.
Experiment Setup Yes For each baseline, we perform an extensive hyper-parameter search over number of epochs and learning rate for each dataset in order to optimize performance. We search over a range of learning rates (1e-3, 1e-4, 3e-5, 1e-5, 8e-6), and range of epochs (5, 10, 15, 20, 50). For all models < 1B parameters, we use a batch size of 1. For all models > 1B parameters, we use a batch size of 8.