Visually-Augmented Language Modeling
Authors: Weizhi Wang, Li Dong, Hao Cheng, Haoyu Song, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, Furu Wei
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate VALM on various visual knowledge-intensive commonsense reasoning tasks, which require visual information to excel. The experimental results illustrate that VALM outperforms all strong language-only and vision-language baselines with substantial gains in reasoning object commonsense including color, size, and shape. Our code is available at https://github.com/Victorwz/Va LM. |
| Researcher Affiliation | Collaboration | University of California, Santa Barbara Microsoft Research weizhiwang@ucsb.edu, {lidong1, haocheng}@microsoft.com |
| Pseudocode | No | The paper includes a system overview diagram (Figure 1) but does not provide any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/Victorwz/Va LM. |
| Open Datasets | Yes | We use the English corpus of CC-100 (Conneau et al., 2020) as the pre-training text corpus for both VALM and baseline GPT-2 . CC-100 corpus is one of the largest high-quality web-crawl text data. |
| Dataset Splits | No | The paper states 'We use the English corpus of CC-100 (Conneau et al., 2020) as the pre-training text corpus' and 'Due to the limitation of computing resources, we only consume 15% of CC-100 English monolingual corpus for pre-training VALM and baseline GPT-2', and that evaluation is done 'in a zero-shot manner without any task-specific tuning'. It does not explicitly define specific train/validation/test splits for the data used in their experiments. |
| Hardware Specification | Yes | The proposed VALM and re-implemented GPT-2 are trained for 500k steps using 16 Nvidia Tesla V100-SXM2-32GB GPUs. |
| Software Dependencies | No | The paper mentions software like 'fairseq toolkit', 'Adam optimizer', and 'faiss toolkit', but it does not specify version numbers for these software dependencies (e.g., 'fairseq vX.Y.Z'). |
| Experiment Setup | Yes | Hyperparameter setting and training details are presented in Appendix B.1. The proposed model deploys transformer decoder architecture with 124M trainable parameters, in which nlayer = 12, nhead = 12, dembed = 768. We deploy Adam (Kingma & Ba, 2015) (β1 = 0.9, β2 = 0.98) optimizer and train all models with lr = 0.0005, twarmup = 4000, dropout = 0.1, bsz = 128, len = 512. The layer normalization over the retrieved image keys is initialized with ϵ of 0.00001. |