PaLI: A Jointly-Scaled Multilingual Language-Image Model
Authors: Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish V Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme Ruiz, Andreas Peter Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Pa LI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design. |
| Researcher Affiliation | Industry | Google Research |
| Pseudocode | No | No pseudocode or clearly labeled algorithm block was found. |
| Open Source Code | No | The model is for research prototype and the current version is not available for the public. |
| Open Datasets | Yes | The model is pre-trained on the following mixture of datasets: Web LI (Table 24), CC3M-35L (Sharma et al., 2018), VQ2A-CC3M-35L (Changpinyo et al., 2022a), Open Images (Kuznetsova et al., 2020), Visual Genome (Krishna et al., 2017) and Object365 (Shao et al., 2019). |
| Dataset Splits | Yes | we perform near de-duplication of the images against the train, validation, and test splits of 68 common vision/vision-language datasets. |
| Hardware Specification | Yes | The largest model, Pa LI-17B, is pretrained using 1,024 GCP-TPUv4 chips for 7 days. |
| Software Dependencies | No | The overall Pa LI models are implemented in JAX/Flax (Bradbury et al., 2018) using the open-source T5X (Roberts et al., 2022) and Flaxformer (Heek et al., 2020) frameworks. |
| Experiment Setup | Yes | All Pa LI variants are trained for one epoch over the entire pre-training dataset (1.6B) with 224 224 image resolution. For the largest model, Pa LI-17B, we perform an additional high-res (588 588) phase similar to previous works (Radford et al., 2021; Yuan et al., 2021; Yu et al., 2022). For the learning rate, we use a 1k-step linear warmup, followed by inverse square-root decay. For Pa LI-3B, we use a peak learning rate of 1e-2. For larger models, Pa LI-15B and Pa LI-17B, we use a peak learning rate of 5e-3. We use the Adafactor (Shazeer & Stern, 2018) optimizer with β1 = 0 and second-moment exponential decay set to 0.8. |