Play to Grade: Testing Coding Games as Classifying Markov Decision Process
Authors: Allen Nie, Emma Brunskill, Chris Piech
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that by designing a cooperative objective between an agent and an autoregressive model, we can use the agent to sample differential trajectories from the input MDP that allows a classifier to determine membership: Play to Grade. Our method enables an automatic feedback system for interactive code assignments. We release a dataset of 711,274 anonymized student submissions to a single assignment with hand-coded bug labels to support future research. |
| Researcher Affiliation | Academia | Allen Nie Computer Science Stanford University Emma Brunskill Computer Science Stanford University Chris Piech Computer Science Stanford University |
| Pseudocode | Yes | Algorithm 1: Collaborative Reinforcement Learning TRAINDIFFERENTIALPOLICY (M, M, π0, ˆdθ) δ = 0.1 for n N do Dτ = Collect Trajectory(M, M, π0) ˆdθ = Train Dist Func(L(θ), Dτ) δ = Adjust Delta(M, M, , π0, δ, ˆdθ) Let r = 1( ˆdθ(s, a) > δ) πn = Train DQN(π0, M, r ) end return πN, ˆdθ, δ |
| Open Source Code | Yes | We release a Bounce dataset of student submissions with ground truth bug labels and an Open AIGym compatible environment to support further research: https://github.com/windweller/ play-to-grade. |
| Open Datasets | Yes | We release a Bounce dataset of student submissions with ground truth bug labels and an Open AIGym compatible environment to support further research: https://github.com/windweller/ play-to-grade. Our dataset is compiled of 453,211 students who wrote a solution to the Bounce assignment. In total, there are 711,274 submissions, where 111,773 unique programs were submitted. |
| Dataset Splits | No | The paper describes sampling data for 'evaluation' and mentions training on a small set of programs, but does not explicitly detail a separate 'validation' dataset split. For example, in Section 5.2, 'Evaluation': 'For both Body and Tail distribution, we sample 250 correct and 250 incorrect programs uniformly from each set for evaluation.' and 'We train our prediction on 10 incorrect programs and 1 correct program.' |
| Hardware Specification | Yes | Experiments are run on one Nvidia RTX 3070 GPU. The training takes less than 30 minutes and evaluation takes less than 5 hours for our sampled set of programs. |
| Software Dependencies | No | The paper mentions an 'Open AIGym compatible environment' but does not specify other key software dependencies or their version numbers required to replicate the experiment. |
| Experiment Setup | Yes | We report the model specification and optimization details in appendix. |