Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Personalized Exercise Recommendation with Semantically-Grounded Knowledge Tracing
Authors: Yilmazcan Ozyurt, Tunaberk Almaci, Stefan Feuerriegel, Mrinmaya Sachan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the effectiveness of our Ex Rec using various RL methods across four realworld tasks with different educational goals in online math learning. We further show that Ex Rec generalizes robustly to new, unseen questions and that it produces interpretable student learning trajectories. |
| Researcher Affiliation | Academia | Yilmazcan Ozyurt ETH Zürich Tunaberk Almaci ETH Zürich Stefan Feuerriegel Munich Center for Machine Learning & LMU Munich Mrinmaya Sachan ETH Zürich |
| Pseudocode | No | The paper describes methods and processes using mathematical formulations and descriptive text (e.g., Section 4 'Ex Rec Framework', 'KC Annotation via LLMs (Module 1)', 'Representation Learning via Contrastive Learning (Module 2)', etc.) but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and trained models are provided in https://github.com/oezyurty/Ex Rec . |
| Open Datasets | Yes | We use the XES3G5M dataset [51], a large-scale KT benchmark with high-quality math questions. It contains 7,652 unique questions and 5.5M interactions from 18,066 students. As the original questions are in Chinese, we have translated them into English. See Appendix B for details. |
| Dataset Splits | Yes | In evaluation, we compare RL algorithms across 2048 students, i. e., environments, from the test set of the dataset. |
| Hardware Specification | Yes | The training is performed on an NVIDIA A100 GPU (40GB) and completed in under 6 hours. |
| Software Dependencies | No | We integrate our trained KT model as an RL environment within the Tianshou library [74], following the Open AI Gym API specification [9] to ensure seamless compatibility. This design allows multiple RL agents to interact with the KT-based environment for a comprehensive and flexible benchmarking of exercise recommendation policies. For implementation, we customize the py KT library [48] to support our custom model architecture and KC-level supervision. |
| Experiment Setup | Yes | We train the model for 50 epochs using a batch size of 32, a learning rate of 5e-5, dropout of 0.1, and a temperature of 0.1 in the similarity function. The training is performed on an NVIDIA A100 GPU (40GB) and completed in under 6 hours. |