UL2: Unifying Language Learning Paradigms
Authors: Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, Tal Schuster, Steven Zheng, Denny Zhou, Neil Houlsby, Donald Metzler
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups. |
| Researcher Affiliation | Industry | Google Research, Brain Team {yitay,dehghani}@google.com |
| Pseudocode | Yes | C.4 IMPLEMENTATION DETAILS AND UL2 CODE This section aims to give more insight to how UL2 pretraining is implemented. Our implementation is actually pretty simple. It is simply a mixture of different pretraining objectives that is implemented in seqio7. Most of our experiments were run with simply mixing different seqio tasks with seqio s Mixture Registry. However, one could also implement a generalized UL2 objective with the following function which could be cleaner. def ul2_objective(dataset: tf.data.Dataset, sequence_length: seqio.preprocessors.Sequence Length Type, output_features: seqio.preprocessors.Output Features Type, use_prefix_lm_task: bool = False, rates: Optional[Sequence[float]] = None, mean_noise_span_lengths: Sequence[float] = (3.0,), noise_densities: Sequence[float] = (0.15,), shard_ds: bool = True, optional_task_prefixes: Optional[Sequence[str]] = None, input_feature_key: str = "inputs", merge_examples_to_reduce_padding: bool = True, reserved_for_packing: bool = None, seed: int = 7) -> tf.data.Dataset: |
| Open Source Code | Yes | We publicly release Flax-based T5X model checkpoints for the 20B model.Pretrained checkpoints will be released at https://github.com/anonymous. |
| Open Datasets | Yes | The datasets we use are Super GLUE (Wang et al., 2019), comprising of 8 NLU sub-tasks. We also conduct experiments on 3 datasets from the GEM benchmark (Gehrmann et al., 2021) that focuses on language generation problems. We arbitrarily select XSUM (summarization), To TTo (table-to-text generation) (Parikh et al., 2020) and Schema Guided Dialog (SGD) (Rastogi et al., 2019) from the GEM benchmark. |
| Dataset Splits | Yes | The datasets we use are Super GLUE (Wang et al., 2019), comprising of 8 NLU sub-tasks. We also conduct experiments on 3 datasets from the GEM benchmark (Gehrmann et al., 2021) that focuses on language generation problems. We arbitrarily select XSUM (summarization), To TTo (table-to-text generation) (Parikh et al., 2020) and Schema Guided Dialog (SGD) (Rastogi et al., 2019) from the GEM benchmark. |
| Hardware Specification | Yes | Each pretraining run is typically trained using 64 to 128 TPUv4 chips (Jouppi et al., 2020).and We use a batch size of 1024 and 512 TPUv4 chips for pretraining this model. |
| Software Dependencies | No | Our experiments are all conducted in JAX/Flax (Bradbury et al., 2018) using the open source T5X3 framework (Roberts et al., 2022) and Flaxformer4. |
| Experiment Setup | Yes | We pre-train all models for 500K steps with a batch size of 128 and a sequence length of 512 inputs and 512 targets using the C4 corpus. ... We optimize our model with the Adafactor (Shazeer & Stern, 2018) optimizer with an inverse square root learning rate." and "For supervised finetuning, we generally adopt a learning rate in the range of {5e-5, 1e-5 to 1e-4} using the Adafactor optimizer. The general recipe is that we reset Adafactor optimizer states and/or adopt a loss normalization based on the number of real target tokens. ... Batch size is generally a range of 32 to 128 |