Better Peer Grading through Bayesian Inference

Authors: Hedayat Zarkoob, Greg d'Eon, Lena Podina, Kevin Leyton-Brown

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach on both synthetic and real-world data obtained by using our implemented system in four large classes. These extensive experiments show that grade aggregation using our model accurately estimates true grades, students likelihood of submitting uninformative grades, and the variation in their inherent grading error; we also characterize our models robustness.
Researcher Affiliation Academia Hedayat Zarkoob1, Greg d Eon1, Lena Podina1,2, Kevin Leyton-Brown1 1 Department of Computer Science, University of British Columbia 2 Cheriton School of Computer Science, University of Waterloo
Pseudocode No The provided paper text does not contain any pseudocode or clearly labeled algorithm blocks. It mentions that full details of Gibbs updates are in Appendix A, but the Appendix content is not included in the provided text.
Open Source Code Yes Open-source implementations of our models are available at https://github.com/hezar1000/mta-inference-public.
Open Datasets No The paper uses real data gathered from its own course offerings and synthetic data generated by the authors. It mentions: “Research use of our peer grading datasets was authorized by the University of British Columbia s Behavioural Research Ethics Board (BREB #H21-03499),” but does not provide public access (link, DOI, or common dataset name with citation) to these datasets.
Dataset Splits Yes Instead, we used 10-fold stratified cross-validation. We first split the dataset into 10 groups of n/10 peer grades, ensuring that no two peer grades on the same submission were in the same group. Then, for each way of selecting 9 groups from the 10, we ran the model on these selected observations, summing the model s log likelihoods on the remaining group.
Hardware Specification No The paper mentions “8 CPU hours” for running experiments but does not specify any particular CPU model, GPU model, memory, or other hardware components used for computation.
Software Dependencies No The paper mentions “open-source implementations of our models” and notes that “our appendix is available at https://arxiv.org/abs/2209.01242,” but it does not explicitly list any software dependencies with specific version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes Each time we fit our model to a dataset, we collected 4 runs of 1,100 Gibbs samples, discarding the first 100 burnin samples from each run and concatenating the remainder; this took about 8 CPU hours. ... We independently optimized each model s hyperparameters using randomized search, choosing the hyperparameters that maximized the model s held-out likelihood; full details of this hyperparameter search, along with the resulting hyperparameters, are presented in Appendix D. ... We set the MIP constants to the defaults recommended in Appendix C, allowing the graders weights to change by at most S = 0.09, with a minimum non-zero weight of T = 0.1.