Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
LogiCase: Effective Test Case Generation from Logical Description in Competitive Programming
Authors: Sicheol Sung, Aditi, Dogyu Kim, Yo-Sub Han, Sang-Ki Ko
IJCAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the Code Contests dataset demonstrate that CCFG-based test cases outperform baseline methods in identifying incorrect algorithms, achieving significant gains in validity and effectiveness. Our approach provides a scalable and reliable grammar-driven framework for enhancing automated competitive programming evaluations. We evaluate the practical usefulness of CCFGs through experimental validation. |
| Researcher Affiliation | Academia | Sicheol Sung1 , Aditi2 , Dogyu Kim3 , Yo-Sub Han1 and Sang-Ki Ko2 1Yonsei University 2University of Seoul 3Kangwon National University |
| Pseudocode | No | The paper provides formal grammar definitions (Example 2 and Example 3) but does not include structured pseudocode or algorithm blocks for a procedural method or process. |
| Open Source Code | Yes | All implementations and associated codes and datasets used in these experiments are available in our Git Hub repository.2 2https://github.com/Aditi1612/Grammar-based-test-case-generation |
| Open Datasets | Yes | We use the Code Contests dataset, which consists of various programming problems sourced from different competitive platforms [Li et al., 2022]. |
| Dataset Splits | Yes | After categorizing the grammars, we split them into a training dataset with 1,200 problems and an evaluation dataset with 300 problems. |
| Hardware Specification | No | The paper describes experiments and model training but does not provide specific details about the hardware used, such as CPU or GPU models, or memory specifications. |
| Software Dependencies | No | The paper mentions using a fine-tuned Code T5 model and an Adam optimizer but does not specify versions for any programming languages, libraries, or frameworks used in the implementation. |
| Experiment Setup | Yes | We use Adam optimizer with learning rate 10 5 and cross-entropy loss function to train each Code T5 model. We generate candidate grammars and constraints with repetition penalty 2.5 and length penalty 1.0 from each model. |