Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Core Knowledge Deficits in Multi-Modal Language Models
Authors: Yijiang Li, Qingying Gao, Tianwei Zhao, Bingyang Wang, Haoran Sun, Haiyun Lyu, Robert D. Hawkins, Nuno Vasconcelos, Tal Golan, Dezhi Luo, Hokin Deng
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce Core Cognition, a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science. We evaluate 230 models with 11 different prompts, leading to a total of 2,530 data points for analysis. Our experiments uncover four key findings, collectively demonstrating core knowledge deficits in MLLMs |
| Researcher Affiliation | Academia | 1University of California San Diego 2Johns Hopkins University 3Emory University 4University of North Carolina at Chapel Hill 5Stanford University 6Ben-Gurion University of the Negev 7University of Michigan 8University College London 9Carnegie Mellon University. Correspondence to: Yijiang Li <EMAIL>, Dezhi Luo <EMAIL>, Hokin Deng <EMAIL>. |
| Pseudocode | No | The paper describes methodologies and experiments but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Project page at https://williamium3000. github.io/core-knowledge/. This is a project page, not an explicit statement of code release or a direct link to a code repository for the methodology. |
| Open Datasets | Yes | We introduce Core Cognition, a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science... Project page at https://williamium3000. github.io/core-knowledge/. |
| Dataset Splits | No | The paper introduces a benchmark called Core Cognition comprising 1,503 samples, but it does not specify any training, validation, or test dataset splits for reproduction. |
| Hardware Specification | Yes | Inference is performed on clusters equipped with 8 NVIDIA A100 80 GB GPUs. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as libraries, frameworks, or programming languages. |
| Experiment Setup | Yes | For each k-choice question, we cyclically rotate the answer options k times, generating k versions with different option orders... We apply a Hybrid Matching mechanism. Specifically, we prioritize a rule-based template matching approach to extract answers from MLLM responses. If template matching method failed, we turn to a model-based ensemble strategy using four advanced LLMs: Qwen2.5-72B-Instruct, Mixtral-8x7B-Instruct-v0.1, Deep Seek-R1-Distill-Llama-70B, and llama3.1-70B. The LLMbased result is accepted only when at least three of the four models produce consistent extractions; otherwise, the matching is deemed unsuccessful. |