reproducibilityindex.ai

TART: A plug-and-play Transformer module for task-agnostic reasoning

Authors: Kush Bhatia, Avanika Narayan, Christopher M. De Sa, Christopher Ré

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments actually reveal that LLM representations contain sufficient information to make good predictions. As such, we focus on the LLM s reasoning abilities and demonstrate that this performance gap exists due to their inability to perform simple probabilistic reasoning tasks.
Researcher Affiliation	Academia	{kushb, avanika, chrismre}@cs.stanford.edu, cdesa@cs.cornell.edu
Pseudocode	Yes	Figure 2: TART. (Left) Inference module training procedure: The inference module is trained on sequences of synthetically generated logistic regression tasks. (Right) End-to-end framework: TART composes a pre-trained LLM with the inference module.
Open Source Code	Yes	Our code and model is available at https://github.com/HazyResearch/TART
Open Datasets	Yes	The evaluation datasets include: SST [35], Rotten Tomatoes [29], SMS Spam [3], IMDB [27], Civil Comments [8], AGNews [48], DBPedia [48], and the Youtube dataset [46].
Dataset Splits	No	The paper describes sampling training sets and performing hyperparameter searches, but does not explicitly define or specify the details of a separate validation split (e.g., percentages or counts) used for this purpose.
Hardware Specification	Yes	We use a single NVIDIA RTX A6000 GPU ( $2/hr) for 6 hours to train our TART inference head, costing a total of $18 to train. We use 4 NVIDIA A100 GPU s ( $3.5 per GPU/hr) for 100 hours for hyperparamter tuning, costing a total of $1,400. We use 8 NVIDIA A100 GPU s ( $3.5 per GPU/hr) for 50 hours to fine-tune all task-adaptation baseline models.
Software Dependencies	No	The paper mentions 'scikit-learn python library' but does not specify a version number for it or other software dependencies.
Experiment Setup	Yes	For each baseline, we perform an extensive hyper-parameter search over number of epochs and learning rate for each dataset in order to optimize performance. We search over a range of learning rates (1e-3, 1e-4, 3e-5, 1e-5, 8e-6), and range of epochs (5, 10, 15, 20, 50). For all models < 1B parameters, we use a batch size of 1. For all models > 1B parameters, we use a batch size of 8.