Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized Language Models
Authors: Shangbin Feng, Weijia Shi, Yuyang Bai, Vidhisha Balachandran, Tianxing He, Yulia Tsvetkov
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we demonstrate that KNOWLEDGE CARD achieves state-of-the-art performance on six benchmark datasets. Ultimately, KNOWLEDGE CARD framework enables dynamic synthesis and updates of knowledge from diverse domains. |
| Researcher Affiliation | Academia | Shangbin Feng1 Weijia Shi1 Yuyang Bai2 Vidhisha Balachandran3 Tianxing He1 Yulia Tsvetkov1 1University of Washington 2Xi an Jiaotong University 3Carnegie Mellon University |
| Pseudocode | Yes | Algorithm 1: Bottom-Up Approach ... Algorithm 2: Top-Down Approach |
| Open Source Code | Yes | 1Resources are available at https://github.com/Bunsen Feng/Knowledge Card. |
| Open Datasets | Yes | For general-purpose QA, we adopt MMLU (Hendrycks et al., 2020)... To evaluate multi-domain knowledge synthesis, we adopt misinformation detection... We leverage the widely adopted LUN misinformation detection dataset (Rashkin et al., 2017)... |
| Dataset Splits | No | The paper mentions '5-shot in-context learning setting' and an official 'demonstration set' for MMLU and MIDTERMQA, and '16-shot in-context learning' for LUN, which are used for few-shot learning. However, it does not specify a distinct 'validation' split with percentages or counts for hyperparameter tuning or model selection. |
| Hardware Specification | Yes | We used a GPU cluster with 16 NVIDIA A40 GPUs, 1988G memory, and 104 CPU cores for the experiments. |
| Software Dependencies | No | The paper lists specific models and tools used (e.g., OPT-1.3B, MPNet, Pegasus, Codex, Fact KB, Vitamin C) along with their citations, but it does not provide specific version numbers for the underlying software libraries or environment (e.g., Python version, PyTorch/TensorFlow version, CUDA version). |
| Experiment Setup | Yes | We present hyperparameter settings in Table 6. ... LEARNING RATE 2e-5, BATCH SIZE 32, MAX EPOCHS 10, OPTIMIZER ADAM, TEMPERATURE 0.1 |