Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data
Authors: Alon Albalak, Colin A. Raffel, William Yang Wang
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experimentation we find that our methods outperform all pre-existing FLAD methods by 4% and lead to the first 3 billion parameter language models that outperform the 175 billion parameter GPT-3. |
| Researcher Affiliation | Academia | Alon Albalak University of California, Santa Barbara EMAIL Colin Raffel University of Toronto Vector Institute EMAIL William Yang Wang University of California, Santa Barbara EMAIL |
| Pseudocode | Yes | We include here pseudo-code for our 2 proposed algorithms. Algorithm 1 contains the pseudo-code for EXP3-FLAD, and Algorithm 2 contains the pseudo-code for UCB1-FLAD. |
| Open Source Code | Yes | All of our code is available at github.com/alon-albalak/FLAD. |
| Open Datasets | Yes | We obtain all datasets from Hugging Face Datasets1, and cast them to the text-to-text format by applying prompt templates from the Public Pool of Prompts (P3) [23] that was used to train T0. |
| Dataset Splits | Yes | For each dataset, we randomly sample five few-shot splits from their training data, containing the same number of training examples as previous works, between 20 to 70 [55, 56]. We further divide each split into equal training and validation partitions for true few-shot learning [57](e.g. 10 train and 10 validation samples for Hella Swag). |
| Hardware Specification | Yes | We train all models (FLAD and non-FLAD) on 40Gb A100s. |
| Software Dependencies | No | We used model checkpoints from Hugging Face Transformers [45]). For all experiments we use the Adafactor optimizer [58]. |
| Experiment Setup | Yes | For the target-only baseline, we use learning rates in {1e-4, 3e-4}. For all other methods, we always use a learning rate of 1e-4. For target-, explore-, and exploit-only baselines we use batch sizes in {32, 128}. For loss-scaling, EXP3-FLAD, and UCB1-FLAD we use mini-batches of 8 samples and let G be in {4, 16} to match the batch size of all methods. For exploreand exploit-only, we use a target dataset mixing ratio of M {1, 5, 10}. For all experiments we use the Adafactor optimizer [58] and validation-based early stopping for model checkpoint selection. |