Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Personalized Exercise Recommendation with Semantically-Grounded Knowledge Tracing

Authors: Yilmazcan Ozyurt, Tunaberk Almaci, Stefan Feuerriegel, Mrinmaya Sachan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the effectiveness of our Ex Rec using various RL methods across four realworld tasks with different educational goals in online math learning. We further show that Ex Rec generalizes robustly to new, unseen questions and that it produces interpretable student learning trajectories.
Researcher Affiliation	Academia	Yilmazcan Ozyurt ETH Zürich Tunaberk Almaci ETH Zürich Stefan Feuerriegel Munich Center for Machine Learning & LMU Munich Mrinmaya Sachan ETH Zürich
Pseudocode	No	The paper describes methods and processes using mathematical formulations and descriptive text (e.g., Section 4 'Ex Rec Framework', 'KC Annotation via LLMs (Module 1)', 'Representation Learning via Contrastive Learning (Module 2)', etc.) but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and trained models are provided in https://github.com/oezyurty/Ex Rec .
Open Datasets	Yes	We use the XES3G5M dataset [51], a large-scale KT benchmark with high-quality math questions. It contains 7,652 unique questions and 5.5M interactions from 18,066 students. As the original questions are in Chinese, we have translated them into English. See Appendix B for details.
Dataset Splits	Yes	In evaluation, we compare RL algorithms across 2048 students, i. e., environments, from the test set of the dataset.
Hardware Specification	Yes	The training is performed on an NVIDIA A100 GPU (40GB) and completed in under 6 hours.
Software Dependencies	No	We integrate our trained KT model as an RL environment within the Tianshou library [74], following the Open AI Gym API specification [9] to ensure seamless compatibility. This design allows multiple RL agents to interact with the KT-based environment for a comprehensive and flexible benchmarking of exercise recommendation policies. For implementation, we customize the py KT library [48] to support our custom model architecture and KC-level supervision.
Experiment Setup	Yes	We train the model for 50 epochs using a batch size of 32, a learning rate of 5e-5, dropout of 0.1, and a temperature of 0.1 in the similarity function. The training is performed on an NVIDIA A100 GPU (40GB) and completed in under 6 hours.