MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge

Authors: Yuxuan Zhou, Xien Liu, Chen Ning, Ji Wu

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results on these multifaceted datasets demonstrate that the extent of current LLMs in mastering medical knowledge is far below their performance on existing medical benchmarks, suggesting that they lack depth, precision, and comprehensiveness in mastering medical knowledge.
Researcher Affiliation Academia Yuxuan Zhou , Xien Liu , Chen Ning and Ji Wu Department of Electronic Engineering, Tsinghua University, Beijing, 100084, China {zhou-yx21, nc22}@mails.tsinghua.edu.cn, {xeliu, wuji ee}@mail.tsinghua.edu.cn
Pseudocode No The paper describes methods and processes in narrative text and uses diagrams, but does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The codes and datasets are available at https://github.com/THUMLP/Multifacet Eval.
Open Datasets Yes Based on the Multifacet Eval framework, we construct two multifaceted evaluation datasets: Multi Dise K (by producing questions from a clinical disease knowledge base) and Multi Med QA (by rephrasing each question from a medical benchmark Med QA into multifaceted questions). The codes and datasets are available at https://github.com/THUMLP/Multifacet Eval.
Dataset Splits No The paper describes constructing two multifaceted evaluation datasets, Multi Dise K and Multi Med QA, which are used to evaluate pre-trained LLMs. It specifies the total number of questions within these datasets (e.g., '6,334 MCQs, 12,668 RQs, 6,334 MAQs, and 6,334 TFQs' for Multi Dise K), but does not provide traditional train/validation/test splits of these datasets, as they are primarily for evaluation rather than training models within the scope of the paper.
Hardware Specification No The paper does not provide specific hardware details (such as exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions various LLMs and evaluation settings (e.g., 'Chain-of-Thought with Self-consistency'), but does not provide specific software dependencies or library versions (like Python, PyTorch, or CUDA versions) required to replicate the experimental setup.
Experiment Setup Yes We evaluate LLMs by five-shot learning on the proposed datasets. We report the performance of LLMs under two settings: (1) answer-only [Brown et al., 2020]: prompting LLMs with only question-answer pairs; (2) Chain-of-Thought with Self-consistency (Co T+SC) [Wang et al., 2022]: prompting LLMs multiple times with questionanswer pairs and the chain-of-thoughts, aggregating the results by majority vote to obtain the final answer. For the latter setting, we generate Co Ts following the method proposed in [Nori et al., 2023b] and ask LLMs each question 5 times in our implementation.