reproducibilityindex.ai

Play to Grade: Testing Coding Games as Classifying Markov Decision Process

Authors: Allen Nie, Emma Brunskill, Chris Piech

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that by designing a cooperative objective between an agent and an autoregressive model, we can use the agent to sample differential trajectories from the input MDP that allows a classiﬁer to determine membership: Play to Grade. Our method enables an automatic feedback system for interactive code assignments. We release a dataset of 711,274 anonymized student submissions to a single assignment with hand-coded bug labels to support future research.
Researcher Affiliation	Academia	Allen Nie Computer Science Stanford University Emma Brunskill Computer Science Stanford University Chris Piech Computer Science Stanford University
Pseudocode	Yes	Algorithm 1: Collaborative Reinforcement Learning TRAINDIFFERENTIALPOLICY (M, M, π0, ˆdθ) δ = 0.1 for n N do Dτ = Collect Trajectory(M, M, π0) ˆdθ = Train Dist Func(L(θ), Dτ) δ = Adjust Delta(M, M, , π0, δ, ˆdθ) Let r = 1( ˆdθ(s, a) > δ) πn = Train DQN(π0, M, r ) end return πN, ˆdθ, δ
Open Source Code	Yes	We release a Bounce dataset of student submissions with ground truth bug labels and an Open AIGym compatible environment to support further research: https://github.com/windweller/ play-to-grade.
Open Datasets	Yes	We release a Bounce dataset of student submissions with ground truth bug labels and an Open AIGym compatible environment to support further research: https://github.com/windweller/ play-to-grade. Our dataset is compiled of 453,211 students who wrote a solution to the Bounce assignment. In total, there are 711,274 submissions, where 111,773 unique programs were submitted.
Dataset Splits	No	The paper describes sampling data for 'evaluation' and mentions training on a small set of programs, but does not explicitly detail a separate 'validation' dataset split. For example, in Section 5.2, 'Evaluation': 'For both Body and Tail distribution, we sample 250 correct and 250 incorrect programs uniformly from each set for evaluation.' and 'We train our prediction on 10 incorrect programs and 1 correct program.'
Hardware Specification	Yes	Experiments are run on one Nvidia RTX 3070 GPU. The training takes less than 30 minutes and evaluation takes less than 5 hours for our sampled set of programs.
Software Dependencies	No	The paper mentions an 'Open AIGym compatible environment' but does not specify other key software dependencies or their version numbers required to replicate the experiment.
Experiment Setup	Yes	We report the model speciﬁcation and optimization details in appendix.