Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Single Transformer for Scalable Vision-Language Modeling

Authors: Yangyi Chen, Xingyao Wang, Hao Peng, Heng Ji

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On extensive evaluation, SOLO demonstrates performance comparable to LLaVA-v1.5-7B, particularly excelling in visual mathematical reasoning. ... We present the main experimental results of SOLO in Tab. 2. ... We select a wide range of benchmarks, encompassing both general vision-language tasks and specific taskoriented datasets, for evaluation and analysis. ... Further scalability analysis reveals SOLO’s better scaling behaviors, inference speed advantages, easier scaling laws analysis, and the scalability and benefits of our flexible image preprocessing pipeline (6.2). In addition, through comprehensive ablation studies, we validate the design choices of our training recipe.
Researcher Affiliation Academia Yangyi Chen , Xingyao Wang , Hao Peng, Heng Ji University of Illinois Urbana-Champaign EMAIL
Pseudocode Yes Figure 2: The input image resize algorithm to maintain the aspect ratio. ... def get_resize_output_image_size(image_size): l1, l2 = image_size if l2 <= l1: short, long = l2, l1 else: short, long = l1, l2 requested_new_long = min(int(long / PATCH_SIZE + 1) * PATCH_SIZE, MAX_RESOLUTION ) new_long = requested_new_long new_short = int(new_long * short / long) new_short = int(new_short / PATCH_SIZE + 1) * PATCH_SIZE if l2 <= l1: return new_long, new_short else: return new_short, new_long
Open Source Code Yes The code is made public at https://github.com/Yangyi-Chen/SOLO.
Open Datasets Yes Stage-1 ImageNet21K (Ridnik et al., 2021b)... Stage-2 Pre-Training on Web-Scale Data... sources like Capfusion (Yu et al., 2024) and CC3M (Sharma et al., 2018b). Additionally, we include synthetically generated web pages with associated HTML code from Websight (Laurençon et al., 2024) to improve OCR performance, and we also include a small set of supervised datasets to improve the data diversity. ...Table 1: Summary of datasets used in the three stages of pre-training.
Dataset Splits Yes We select a wide range of benchmarks... for evaluation and analysis. For general vision-language capability evaluation, we choose MMStar (Chen et al., 2024b), MME (Fu et al., 2024), and SEED-Bench (Li et al., 2024a). For scientific document understanding, we choose AI2D (Kembhavi et al., 2016) and Science QA (Lu et al., 2022a). For visual mathematical reasoning, we choose Math Vista (Lu et al., 2023).
Hardware Specification Yes In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM with the single Transformer architecture using moderate academic resources (8 x A100 80GB GPUs). ... We use one node with 8 NVIDIA A100 80G GPU for pre-training.
Software Dependencies No We modify the standard Megatron-LLM (Cano et al., 2023) to support arbitrary image patch inputs. ... We utilize DeepSpeed (Rasley et al., 2020), as implemented in Accelerate (Gugger et al., 2022), for instruction fine-tuning. The paper mentions specific software tools (Megatron-LLM, DeepSpeed, Accelerate) but does not provide their version numbers.
Experiment Setup Yes Training Hyperparameter We use a global batch size of 128 examples (i.e., 4M tokens) and each pre-training example is packed to 32,768 tokens. We adopt a learning rate of 5e-5 with cosine decay to a minimum learning rate of 5e-6 and warm up for 200 steps. We use weight decay of 0.1. ... Implementation Details We utilize Deep Speed (Rasley et al., 2020)... The global batch size is configured at 640, with a weight decay parameter of 0.1. We train for 1 epoch with a maximum learning rate of 1e-5, which follows a linear warm-up phase and transitions to a cosine decay schedule.