UL2: Unifying Language Learning Paradigms

Authors: Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, Tal Schuster, Steven Zheng, Denny Zhou, Neil Houlsby, Donald Metzler

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups.
Researcher Affiliation Industry Google Research, Brain Team {yitay,dehghani}@google.com
Pseudocode Yes C.4 IMPLEMENTATION DETAILS AND UL2 CODE This section aims to give more insight to how UL2 pretraining is implemented. Our implementation is actually pretty simple. It is simply a mixture of different pretraining objectives that is implemented in seqio7. Most of our experiments were run with simply mixing different seqio tasks with seqio s Mixture Registry. However, one could also implement a generalized UL2 objective with the following function which could be cleaner. def ul2_objective(dataset: tf.data.Dataset, sequence_length: seqio.preprocessors.Sequence Length Type, output_features: seqio.preprocessors.Output Features Type, use_prefix_lm_task: bool = False, rates: Optional[Sequence[float]] = None, mean_noise_span_lengths: Sequence[float] = (3.0,), noise_densities: Sequence[float] = (0.15,), shard_ds: bool = True, optional_task_prefixes: Optional[Sequence[str]] = None, input_feature_key: str = "inputs", merge_examples_to_reduce_padding: bool = True, reserved_for_packing: bool = None, seed: int = 7) -> tf.data.Dataset:
Open Source Code Yes We publicly release Flax-based T5X model checkpoints for the 20B model.Pretrained checkpoints will be released at https://github.com/anonymous.
Open Datasets Yes The datasets we use are Super GLUE (Wang et al., 2019), comprising of 8 NLU sub-tasks. We also conduct experiments on 3 datasets from the GEM benchmark (Gehrmann et al., 2021) that focuses on language generation problems. We arbitrarily select XSUM (summarization), To TTo (table-to-text generation) (Parikh et al., 2020) and Schema Guided Dialog (SGD) (Rastogi et al., 2019) from the GEM benchmark.
Dataset Splits Yes The datasets we use are Super GLUE (Wang et al., 2019), comprising of 8 NLU sub-tasks. We also conduct experiments on 3 datasets from the GEM benchmark (Gehrmann et al., 2021) that focuses on language generation problems. We arbitrarily select XSUM (summarization), To TTo (table-to-text generation) (Parikh et al., 2020) and Schema Guided Dialog (SGD) (Rastogi et al., 2019) from the GEM benchmark.
Hardware Specification Yes Each pretraining run is typically trained using 64 to 128 TPUv4 chips (Jouppi et al., 2020).and We use a batch size of 1024 and 512 TPUv4 chips for pretraining this model.
Software Dependencies No Our experiments are all conducted in JAX/Flax (Bradbury et al., 2018) using the open source T5X3 framework (Roberts et al., 2022) and Flaxformer4.
Experiment Setup Yes We pre-train all models for 500K steps with a batch size of 128 and a sequence length of 512 inputs and 512 targets using the C4 corpus. ... We optimize our model with the Adafactor (Shazeer & Stern, 2018) optimizer with an inverse square root learning rate." and "For supervised finetuning, we generally adopt a learning rate in the range of {5e-5, 1e-5 to 1e-4} using the Adafactor optimizer. The general recipe is that we reset Adafactor optimizer states and/or adopt a loss normalization based on the number of real target tokens. ... Batch size is generally a range of 32 to 128