PaLI: A Jointly-Scaled Multilingual Language-Image Model

Authors: Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish V Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme Ruiz, Andreas Peter Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Pa LI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.
Researcher Affiliation Industry Google Research
Pseudocode No No pseudocode or clearly labeled algorithm block was found.
Open Source Code No The model is for research prototype and the current version is not available for the public.
Open Datasets Yes The model is pre-trained on the following mixture of datasets: Web LI (Table 24), CC3M-35L (Sharma et al., 2018), VQ2A-CC3M-35L (Changpinyo et al., 2022a), Open Images (Kuznetsova et al., 2020), Visual Genome (Krishna et al., 2017) and Object365 (Shao et al., 2019).
Dataset Splits Yes we perform near de-duplication of the images against the train, validation, and test splits of 68 common vision/vision-language datasets.
Hardware Specification Yes The largest model, Pa LI-17B, is pretrained using 1,024 GCP-TPUv4 chips for 7 days.
Software Dependencies No The overall Pa LI models are implemented in JAX/Flax (Bradbury et al., 2018) using the open-source T5X (Roberts et al., 2022) and Flaxformer (Heek et al., 2020) frameworks.
Experiment Setup Yes All Pa LI variants are trained for one epoch over the entire pre-training dataset (1.6B) with 224 224 image resolution. For the largest model, Pa LI-17B, we perform an additional high-res (588 588) phase similar to previous works (Radford et al., 2021; Yuan et al., 2021; Yu et al., 2022). For the learning rate, we use a 1k-step linear warmup, followed by inverse square-root decay. For Pa LI-3B, we use a peak learning rate of 1e-2. For larger models, Pa LI-15B and Pa LI-17B, we use a peak learning rate of 5e-3. We use the Adafactor (Shazeer & Stern, 2018) optimizer with β1 = 0 and second-moment exponential decay set to 0.8.