Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
TART: A plug-and-play Transformer module for task-agnostic reasoning
Authors: Kush Bhatia, Avanika Narayan, Christopher M. De Sa, Christopher Ré
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments actually reveal that LLM representations contain sufficient information to make good predictions. As such, we focus on the LLM s reasoning abilities and demonstrate that this performance gap exists due to their inability to perform simple probabilistic reasoning tasks. |
| Researcher Affiliation | Academia | EMAIL, EMAIL |
| Pseudocode | Yes | Figure 2: TART. (Left) Inference module training procedure: The inference module is trained on sequences of synthetically generated logistic regression tasks. (Right) End-to-end framework: TART composes a pre-trained LLM with the inference module. |
| Open Source Code | Yes | Our code and model is available at https://github.com/HazyResearch/TART |
| Open Datasets | Yes | The evaluation datasets include: SST [35], Rotten Tomatoes [29], SMS Spam [3], IMDB [27], Civil Comments [8], AGNews [48], DBPedia [48], and the Youtube dataset [46]. |
| Dataset Splits | No | The paper describes sampling training sets and performing hyperparameter searches, but does not explicitly define or specify the details of a separate validation split (e.g., percentages or counts) used for this purpose. |
| Hardware Specification | Yes | We use a single NVIDIA RTX A6000 GPU ( $2/hr) for 6 hours to train our TART inference head, costing a total of $18 to train. We use 4 NVIDIA A100 GPU s ( $3.5 per GPU/hr) for 100 hours for hyperparamter tuning, costing a total of $1,400. We use 8 NVIDIA A100 GPU s ( $3.5 per GPU/hr) for 50 hours to fine-tune all task-adaptation baseline models. |
| Software Dependencies | No | The paper mentions 'scikit-learn python library' but does not specify a version number for it or other software dependencies. |
| Experiment Setup | Yes | For each baseline, we perform an extensive hyper-parameter search over number of epochs and learning rate for each dataset in order to optimize performance. We search over a range of learning rates (1e-3, 1e-4, 3e-5, 1e-5, 8e-6), and range of epochs (5, 10, 15, 20, 50). For all models < 1B parameters, we use a batch size of 1. For all models > 1B parameters, we use a batch size of 8. |