Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Visually-Augmented Language Modeling
Authors: Weizhi Wang, Li Dong, Hao Cheng, Haoyu Song, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, Furu Wei
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate VALM on various visual knowledge-intensive commonsense reasoning tasks, which require visual information to excel. The experimental results illustrate that VALM outperforms all strong language-only and vision-language baselines with substantial gains in reasoning object commonsense including color, size, and shape. Our code is available at https://github.com/Victorwz/Va LM. |
| Researcher Affiliation | Collaboration | University of California, Santa Barbara Microsoft Research EMAIL, EMAIL |
| Pseudocode | No | The paper includes a system overview diagram (Figure 1) but does not provide any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/Victorwz/Va LM. |
| Open Datasets | Yes | We use the English corpus of CC-100 (Conneau et al., 2020) as the pre-training text corpus for both VALM and baseline GPT-2 . CC-100 corpus is one of the largest high-quality web-crawl text data. |
| Dataset Splits | No | The paper states 'We use the English corpus of CC-100 (Conneau et al., 2020) as the pre-training text corpus' and 'Due to the limitation of computing resources, we only consume 15% of CC-100 English monolingual corpus for pre-training VALM and baseline GPT-2', and that evaluation is done 'in a zero-shot manner without any task-specific tuning'. It does not explicitly define specific train/validation/test splits for the data used in their experiments. |
| Hardware Specification | Yes | The proposed VALM and re-implemented GPT-2 are trained for 500k steps using 16 Nvidia Tesla V100-SXM2-32GB GPUs. |
| Software Dependencies | No | The paper mentions software like 'fairseq toolkit', 'Adam optimizer', and 'faiss toolkit', but it does not specify version numbers for these software dependencies (e.g., 'fairseq vX.Y.Z'). |
| Experiment Setup | Yes | Hyperparameter setting and training details are presented in Appendix B.1. The proposed model deploys transformer decoder architecture with 124M trainable parameters, in which nlayer = 12, nhead = 12, dembed = 768. We deploy Adam (Kingma & Ba, 2015) (β1 = 0.9, β2 = 0.98) optimizer and train all models with lr = 0.0005, twarmup = 4000, dropout = 0.1, bsz = 128, len = 512. The layer normalization over the retrieved image keys is initialized with ϵ of 0.00001. |