Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Solving Probability Problems in Natural Language
Authors: Anton Dries, Angelika Kimmig, Jesse Davis, Vaishak Belle, Luc de Raedt
IJCAI 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On a dataset of 2160 probability problems, our solver is able to correctly answer 97.5% of the questions given a correct model. On the end-to-end evaluation, we are able to answer 12.5% of the questions (or 31.1% if we exclude examples not supported by design). |
| Researcher Affiliation | Academia | Department of Computer Science, KU Leuven, Belgium University of Edinburgh, UK |
| Pseudocode | No | The paper describes its methods in prose but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states 'An online version of our system is available at https:// dtai.cs.kuleuven.be/problog/natural_language.', which refers to an online system or demo, not explicitly the source code for the methodology described in the paper. |
| Open Datasets | No | The paper states: 'we hired three students to collect and label probability problems from textbooks and online sources. This has resulted in 2376 probability-related problem descriptions. For 2160 (90.9%) of these examples, we could derive a formal model'. While a dataset was created and used, no concrete access information (link, DOI, citation for public release) is provided for this dataset. |
| Dataset Splits | No | The paper mentions 'trained on 200 randomly selected examples' for the NLP classifier, but it does not provide specific training/test/validation splits (e.g., percentages or counts) for the main dataset of 2160 probability problems used in the overall evaluation. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions software like 'Prob Log', 'Stanford Core NLP', and 'scikit-learn MLPClassifier', but it does not specify any version numbers for these software dependencies. |
| Experiment Setup | Yes | Our solver could solve 2106 correctly within a time limit of 60 seconds per task. [...] This classification is based on a neural-network classifier (using scikit-learn s2 MLPClassifier) trained on 200 randomly selected examples. As features, we use 45 features that describe the structure of the parse tree around the number (see Table 1 for a summary of these features). |