Visually-Augmented Language Modeling

Authors: Weizhi Wang, Li Dong, Hao Cheng, Haoyu Song, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, Furu Wei

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate VALM on various visual knowledge-intensive commonsense reasoning tasks, which require visual information to excel. The experimental results illustrate that VALM outperforms all strong language-only and vision-language baselines with substantial gains in reasoning object commonsense including color, size, and shape. Our code is available at https://github.com/Victorwz/Va LM.
Researcher Affiliation Collaboration University of California, Santa Barbara Microsoft Research weizhiwang@ucsb.edu, {lidong1, haocheng}@microsoft.com
Pseudocode No The paper includes a system overview diagram (Figure 1) but does not provide any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/Victorwz/Va LM.
Open Datasets Yes We use the English corpus of CC-100 (Conneau et al., 2020) as the pre-training text corpus for both VALM and baseline GPT-2 . CC-100 corpus is one of the largest high-quality web-crawl text data.
Dataset Splits No The paper states 'We use the English corpus of CC-100 (Conneau et al., 2020) as the pre-training text corpus' and 'Due to the limitation of computing resources, we only consume 15% of CC-100 English monolingual corpus for pre-training VALM and baseline GPT-2', and that evaluation is done 'in a zero-shot manner without any task-specific tuning'. It does not explicitly define specific train/validation/test splits for the data used in their experiments.
Hardware Specification Yes The proposed VALM and re-implemented GPT-2 are trained for 500k steps using 16 Nvidia Tesla V100-SXM2-32GB GPUs.
Software Dependencies No The paper mentions software like 'fairseq toolkit', 'Adam optimizer', and 'faiss toolkit', but it does not specify version numbers for these software dependencies (e.g., 'fairseq vX.Y.Z').
Experiment Setup Yes Hyperparameter setting and training details are presented in Appendix B.1. The proposed model deploys transformer decoder architecture with 124M trainable parameters, in which nlayer = 12, nhead = 12, dembed = 768. We deploy Adam (Kingma & Ba, 2015) (β1 = 0.9, β2 = 0.98) optimizer and train all models with lr = 0.0005, twarmup = 4000, dropout = 0.1, bsz = 128, len = 512. The layer normalization over the retrieved image keys is initialized with ϵ of 0.00001.