Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Up Models and Data with t5x and seqio

Authors: Adam Roberts, Hyung Won Chung, Gaurav Mishra, Anselm Levskaya, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulian, Xavier Garcia, Jianmo Ni, Andrew Chen, Kathleen Kenealy, Kehang Han, Michelle Casbon, Jonathan H. Clark, Stephan Lee, Dan Garrette, James Lee-Thorp, Colin Raffel, Noam Shazeer, Marvin Ritter, Maarten Bosma, Alexandre Passos, Jeremy Maitin-Shepard, Noah Fiedel, Mark Omernick, Brennan Saeta, Ryan Sepassi, Alexander Spiridonov, Joshua Newlan, Andrea Gesmundo

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental These open-source libraries have been used to train models with hundreds of billions of parameters on multiterabyte datasets. Configurations and instructions for T5-like and GPT-like models are also provided.
Researcher Affiliation Industry Lead Authors Adam Roberts EMAIL Hyung Won Chung EMAIL Gaurav Mishra EMAIL Anselm Levskaya EMAIL James Bradbury EMAIL
Pseudocode No The paper describes the architecture and functionality of t5x and seqio but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes The libraries can be found at https://github.com/google-research/t5x and https://github.com/google/seqio.
Open Datasets No The paper mentions that t5x and seqio have been used with 'multiterabyte datasets' and discusses 'managing data pipelines' but does not provide concrete access information (links, DOIs, citations) for any specific open datasets used in the context of this paper's methodologies or experiments.
Dataset Splits No The paper discusses data pipelines and training on large datasets but does not provide specific details on how datasets were split for training, validation, or testing.
Hardware Specification Yes Major differentiators of t5x are its use of JAX and Flax for model expression, its support for TPU (including TPU v4)... GPU Support. We provide examples and instructions to run t5x on GPUs in singlenode and multi-node configurations, with optimizations for better throughput. More examples can be found in the NVIDIA Rosetta repository7 which includes H100 FP8 support and performance improvements.
Software Dependencies No t5x leverages Jax s(Bradbury et al., 2018; Frostig et al., 2018) user-friendly Num Py-like(Harris et al., 2020) user interface and its powerful jax.pjit API...Additionally, training at scale requires large datasets. We also introduce seqio, an open-source library for managing data pipelines and model evaluations. seqio builds on tensorflow.data, adds support for SPMD-based data parallelism and is compatible with popular modeling frameworks including JAX, Tensor Flow(Abadi et al., 2015), and Py Torch(Paszke et al., 2019)...For model implementation, t5x leverages specialized features in the Flax (Heek et al., 2020) library...For configuration, we use Gin2 for dependency injection. The paper lists several software libraries and frameworks (Jax, NumPy, TensorFlow, PyTorch, Flax, Gin) but does not provide specific version numbers for these dependencies.
Experiment Setup No The paper describes the design and capabilities of the t5x and seqio libraries, including various parallelism strategies and model implementations. However, it does not provide specific experimental setup details such as hyperparameter values (e.g., learning rates, batch sizes, number of epochs) or detailed training configurations for any specific experiment.