Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains
Authors: Yein Park, Chanwoong Yoon, Jungwoo Park, Donghyeon Lee, Minbyul Jeong, Jaewoo Kang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To overcome this, we introduce CHROKNOWBENCH, a benchmark dataset designed to evaluate chronologically accumulated knowledge across three key aspects: multiple domains, time dependency, temporal state. Our evaluation led to the following observations: (1) The ability of eliciting temporal knowledge varies depending on the data format that model was trained on. |
| Researcher Affiliation | Collaboration | Yein Park1, Chanwoong Yoon1, Jungwoo Park1,3, Donghyeon Lee1,3, Minbyul Jeong2 , Jaewoo Kang1,3 Korea University1 Upstage AI2 AIGEN Sciences3 |
| Pseudocode | Yes | Algorithm 1: Iterative Distractor Generation Algorithm Algorithm 2: Chronological Prompting Algorithm |
| Open Source Code | Yes | Our datasets and code are publicly available at https://github.com/dmis-lab/ChroKnowledge |
| Open Datasets | Yes | Our datasets and code are publicly available at https://github.com/dmis-lab/ChroKnowledge |
| Dataset Splits | Yes | The test set consists of 10% of the total dataset from each domain. |
| Hardware Specification | Yes | The precision is done with eight NVIDIA A100 GPUs(80GB). |
| Software Dependencies | No | We utilize the rapidfuzz library to compare the model s responses with the predefined labels. ... We utilize the spaCy en_core_web_lg model to detect named entities in the paragraphs... |
| Experiment Setup | Yes | We use a temperature set T 0, 0.7 to capture variations in prediction, where T includes both greedy decoding and temperature sampling. We set n as 5, meaning that we evaluate using five distinct combinations of few-shot exemplars to ensure the robust assessment. |