Measuring Vision-Language STEM Skills of Neural Models

Authors: Jianhao Shen, Ye Yuan, Srbuhi Mirzoyan, Ming Zhang, Chenguang Wang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we show the performance of a wide set of neural models as well as humans on STEM. The results show that state-of-the-art foundation models like CLIP and GPT-3.5-Turbo still underperform general elementary students.
Researcher Affiliation Academia Jianhao Shen1,2 , Ye Yuan1,2,3 , Srbuhi Mirzoyan1,2,3, Ming Zhang1,2,3 , Chenguang Wang4 1School of Computer Science, Peking University 2National Key Laboratory for Multimedia Information Processing, Peking University 3Peking University-Anker Embodied AI Lab 4Washington University in St. Louis {jhshen,yuanye pku,mzhang cs}@pku.edu.cn, srbuhimirzoyan@stu.pku.edu.cn chenguangwang@wustl.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes 1The dataset and leaderboard are available at https://huggingface.co/datasets/stemdataset/STEM
Open Datasets Yes 1The dataset and leaderboard are available at https://huggingface.co/datasets/stemdataset/STEM
Dataset Splits Yes We split the dataset into a train set, a validation set, and a test set for model development and evaluation. The overall dataset statistics are included in Table 1.
Hardware Specification Yes We use NVIDIA GeForce RTX 3090 GPUs for training.
Software Dependencies No The paper mentions models like CLIP, GPT-3.5-Turbo, GloVe, Unified QA, ViLBERT, UNITER, Virtex, but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries used in their implementation beyond naming the API versions for certain models.
Experiment Setup Yes We use Adam W for optimization and tune hyperparameters as follows: batch size is chosen from {16, 32, 64, 128}, and set to 16 for few-shot learning, 128 for finetuning and multi-task learning after hyperparameter tuning. The learning rate is chosen between [5e-6, 5e-5] and set to 1e-5 for all training. We set the warm-up ratio to 0.1 and set weight decay as 0.2. We set the maximum of training samples to 100k for finetuning, 200k for multitask training, and 10 epochs for few-shot training, all with early stopping on the valid set.