reproducibilityindex.ai

Better Peer Grading through Bayesian Inference

Authors: Hedayat Zarkoob, Greg d'Eon, Lena Podina, Kevin Leyton-Brown

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on both synthetic and real-world data obtained by using our implemented system in four large classes. These extensive experiments show that grade aggregation using our model accurately estimates true grades, students likelihood of submitting uninformative grades, and the variation in their inherent grading error; we also characterize our models robustness.
Researcher Affiliation	Academia	Hedayat Zarkoob1, Greg d Eon1, Lena Podina1,2, Kevin Leyton-Brown1 1 Department of Computer Science, University of British Columbia 2 Cheriton School of Computer Science, University of Waterloo
Pseudocode	No	The provided paper text does not contain any pseudocode or clearly labeled algorithm blocks. It mentions that full details of Gibbs updates are in Appendix A, but the Appendix content is not included in the provided text.
Open Source Code	Yes	Open-source implementations of our models are available at https://github.com/hezar1000/mta-inference-public.
Open Datasets	No	The paper uses real data gathered from its own course offerings and synthetic data generated by the authors. It mentions: “Research use of our peer grading datasets was authorized by the University of British Columbia s Behavioural Research Ethics Board (BREB #H21-03499),” but does not provide public access (link, DOI, or common dataset name with citation) to these datasets.
Dataset Splits	Yes	Instead, we used 10-fold stratiﬁed cross-validation. We ﬁrst split the dataset into 10 groups of n/10 peer grades, ensuring that no two peer grades on the same submission were in the same group. Then, for each way of selecting 9 groups from the 10, we ran the model on these selected observations, summing the model s log likelihoods on the remaining group.
Hardware Specification	No	The paper mentions “8 CPU hours” for running experiments but does not specify any particular CPU model, GPU model, memory, or other hardware components used for computation.
Software Dependencies	No	The paper mentions “open-source implementations of our models” and notes that “our appendix is available at https://arxiv.org/abs/2209.01242,” but it does not explicitly list any software dependencies with specific version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup	Yes	Each time we ﬁt our model to a dataset, we collected 4 runs of 1,100 Gibbs samples, discarding the ﬁrst 100 burnin samples from each run and concatenating the remainder; this took about 8 CPU hours. ... We independently optimized each model s hyperparameters using randomized search, choosing the hyperparameters that maximized the model s held-out likelihood; full details of this hyperparameter search, along with the resulting hyperparameters, are presented in Appendix D. ... We set the MIP constants to the defaults recommended in Appendix C, allowing the graders weights to change by at most S = 0.09, with a minimum non-zero weight of T = 0.1.