Play to Grade: Testing Coding Games as Classifying Markov Decision Process

Authors: Allen Nie, Emma Brunskill, Chris Piech

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that by designing a cooperative objective between an agent and an autoregressive model, we can use the agent to sample differential trajectories from the input MDP that allows a classifier to determine membership: Play to Grade. Our method enables an automatic feedback system for interactive code assignments. We release a dataset of 711,274 anonymized student submissions to a single assignment with hand-coded bug labels to support future research.
Researcher Affiliation Academia Allen Nie Computer Science Stanford University Emma Brunskill Computer Science Stanford University Chris Piech Computer Science Stanford University
Pseudocode Yes Algorithm 1: Collaborative Reinforcement Learning TRAINDIFFERENTIALPOLICY (M, M, π0, ˆdθ) δ = 0.1 for n N do Dτ = Collect Trajectory(M, M, π0) ˆdθ = Train Dist Func(L(θ), Dτ) δ = Adjust Delta(M, M, , π0, δ, ˆdθ) Let r = 1( ˆdθ(s, a) > δ) πn = Train DQN(π0, M, r ) end return πN, ˆdθ, δ
Open Source Code Yes We release a Bounce dataset of student submissions with ground truth bug labels and an Open AIGym compatible environment to support further research: https://github.com/windweller/ play-to-grade.
Open Datasets Yes We release a Bounce dataset of student submissions with ground truth bug labels and an Open AIGym compatible environment to support further research: https://github.com/windweller/ play-to-grade. Our dataset is compiled of 453,211 students who wrote a solution to the Bounce assignment. In total, there are 711,274 submissions, where 111,773 unique programs were submitted.
Dataset Splits No The paper describes sampling data for 'evaluation' and mentions training on a small set of programs, but does not explicitly detail a separate 'validation' dataset split. For example, in Section 5.2, 'Evaluation': 'For both Body and Tail distribution, we sample 250 correct and 250 incorrect programs uniformly from each set for evaluation.' and 'We train our prediction on 10 incorrect programs and 1 correct program.'
Hardware Specification Yes Experiments are run on one Nvidia RTX 3070 GPU. The training takes less than 30 minutes and evaluation takes less than 5 hours for our sampled set of programs.
Software Dependencies No The paper mentions an 'Open AIGym compatible environment' but does not specify other key software dependencies or their version numbers required to replicate the experiment.
Experiment Setup Yes We report the model specification and optimization details in appendix.