TART: A plug-and-play Transformer module for task-agnostic reasoning
Authors: Kush Bhatia, Avanika Narayan, Christopher M. De Sa, Christopher RĂ©
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments actually reveal that LLM representations contain sufficient information to make good predictions. As such, we focus on the LLM s reasoning abilities and demonstrate that this performance gap exists due to their inability to perform simple probabilistic reasoning tasks. |
| Researcher Affiliation | Academia | {kushb, avanika, chrismre}@cs.stanford.edu, cdesa@cs.cornell.edu |
| Pseudocode | Yes | Figure 2: TART. (Left) Inference module training procedure: The inference module is trained on sequences of synthetically generated logistic regression tasks. (Right) End-to-end framework: TART composes a pre-trained LLM with the inference module. |
| Open Source Code | Yes | Our code and model is available at https://github.com/HazyResearch/TART |
| Open Datasets | Yes | The evaluation datasets include: SST [35], Rotten Tomatoes [29], SMS Spam [3], IMDB [27], Civil Comments [8], AGNews [48], DBPedia [48], and the Youtube dataset [46]. |
| Dataset Splits | No | The paper describes sampling training sets and performing hyperparameter searches, but does not explicitly define or specify the details of a separate validation split (e.g., percentages or counts) used for this purpose. |
| Hardware Specification | Yes | We use a single NVIDIA RTX A6000 GPU ( $2/hr) for 6 hours to train our TART inference head, costing a total of $18 to train. We use 4 NVIDIA A100 GPU s ( $3.5 per GPU/hr) for 100 hours for hyperparamter tuning, costing a total of $1,400. We use 8 NVIDIA A100 GPU s ( $3.5 per GPU/hr) for 50 hours to fine-tune all task-adaptation baseline models. |
| Software Dependencies | No | The paper mentions 'scikit-learn python library' but does not specify a version number for it or other software dependencies. |
| Experiment Setup | Yes | For each baseline, we perform an extensive hyper-parameter search over number of epochs and learning rate for each dataset in order to optimize performance. We search over a range of learning rates (1e-3, 1e-4, 3e-5, 1e-5, 8e-6), and range of epochs (5, 10, 15, 20, 50). For all models < 1B parameters, we use a batch size of 1. For all models > 1B parameters, we use a batch size of 8. |