UNIFIED-IO: A Unified Model for Vision, Language, and Multi-modal Tasks
Authors: Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now present results for UNIFIED-IO on the GRIT benchmark (Sec 4.1), ablate training data via the GRIT ablation benchmark (Sec 4.2) and evaluate UNIFIED-IO on 16 other benchmarks in computer vision and NLP (Sec 4.3). |
| Researcher Affiliation | Collaboration | Allen Institute for AI, University of Washington, Seattle |
| Pseudocode | No | Not found. The paper includes architectural diagrams and descriptions but no explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and demos for UNIFIED-IO are available at: unified-io.allenai.org |
| Open Datasets | Yes | To fully test this capability, we gather 95 vision, language, and multi-modal datasets from 62 publicly available data sources as targets for our model to learn during multi-task training. |
| Dataset Splits | No | Not found. The paper mentions multi-task training on a large dataset and evaluation on various benchmarks, but it does not explicitly provide details about the specific validation dataset splits used during its own training process, beyond relying on the test splits of established benchmarks. |
| Hardware Specification | No | Not found. The paper mentions training on various model sizes with different batch sizes and parallelization strategies but does not specify details like GPU or CPU models used for training. |
| Software Dependencies | No | Not found. The paper mentions using VQ-GAN and Adafactor optimizer and references their respective papers, but it does not specify software versions for these or other dependencies like Python, PyTorch/TensorFlow, or CUDA. |
| Experiment Setup | Yes | We use a learning rate of 10-2 for the first 10,000 steps and then decay at a rate of 1/k. We train with β1 = 0.9 and β2 = 1.0 k 0.8, where k is the step number. We use global norm gradient clipping with 1.0 and find this is crucial to stabilized XL training. We train the Small, Base and Large with a batch size of 2048 and XL with batch size of 1024 due to memory consideration. ... For all models, we train 1000k steps 500k for pre-training and multi-task training respectively. |