Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LogiCase: Effective Test Case Generation from Logical Description in Competitive Programming

Authors: Sicheol Sung, Aditi, Dogyu Kim, Yo-Sub Han, Sang-Ki Ko

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on the Code Contests dataset demonstrate that CCFG-based test cases outperform baseline methods in identifying incorrect algorithms, achieving significant gains in validity and effectiveness. Our approach provides a scalable and reliable grammar-driven framework for enhancing automated competitive programming evaluations. We evaluate the practical usefulness of CCFGs through experimental validation.
Researcher Affiliation	Academia	Sicheol Sung1 , Aditi2 , Dogyu Kim3 , Yo-Sub Han1 and Sang-Ki Ko2 1Yonsei University 2University of Seoul 3Kangwon National University
Pseudocode	No	The paper provides formal grammar definitions (Example 2 and Example 3) but does not include structured pseudocode or algorithm blocks for a procedural method or process.
Open Source Code	Yes	All implementations and associated codes and datasets used in these experiments are available in our Git Hub repository.2 2https://github.com/Aditi1612/Grammar-based-test-case-generation
Open Datasets	Yes	We use the Code Contests dataset, which consists of various programming problems sourced from different competitive platforms [Li et al., 2022].
Dataset Splits	Yes	After categorizing the grammars, we split them into a training dataset with 1,200 problems and an evaluation dataset with 300 problems.
Hardware Specification	No	The paper describes experiments and model training but does not provide specific details about the hardware used, such as CPU or GPU models, or memory specifications.
Software Dependencies	No	The paper mentions using a fine-tuned Code T5 model and an Adam optimizer but does not specify versions for any programming languages, libraries, or frameworks used in the implementation.
Experiment Setup	Yes	We use Adam optimizer with learning rate 10 5 and cross-entropy loss function to train each Code T5 model. We generate candidate grammars and constraints with repetition penalty 2.5 and length penalty 1.0 from each model.