Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SketchMind: A Multi-Agent Cognitive Framework for Assessing Student-Drawn Scientific Sketches

Authors: Ehsan Latif, Zirak Khan, Xiaoming Zhai

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate SKETCHMIND on a curated dataset of 3,575 student-generated sketches across six science assessment items with different highest order of Bloom s level that require students to draw models to explain phenomena. Compared to baseline GPT-4o performance without SRG (average accuracy: 55.6%), and with b SRG integration achieves 77.1% average accuracy (+21.4% average absolute gain). We also demonstrate that multi-agent orchestration with SRG enhances SKETCHMIND performance, for example, a Sketch Mind with GPT-4.1 gains an average 8.9% increase in sketch prediction accuracy, outperforming single-agent pipelines across all items. Human evaluators rated the feedback and co-created sketches generated by SKETCHMIND with GPT-4.1, which achieved an average of 4.1 out of 5, significantly higher than those of baseline models (e.g., 2.3 for GPT-4o).
Researcher Affiliation	Academia	Ehsan Latif AI4STEM Education Center University of Georgia Athens, GA 30605 Zirak Khan School of Computing University of Georgia Athens, GA 30605 Xiaoming Zhai AI4STEM Education Center University of Georgia Athens, GA 30605 Corresponding author email: EMAIL
Pseudocode	Yes	class SRGBuilder: BLOOM_ORDER = ["Remember", "Understand", "Apply", "Analyze", "Evaluate", "Create"] , def __init__(self, question: str, rubric: str): self.question = question self.rubric = rubric self.nodes: List[Tuple[str, str]] = [] # (concept, Bloom level) self.edges: List[Tuple[str, str]] = [] # (from, to) def add_node(self, concept: str, bloom_level: str): assert bloom_level in self.BLOOM_ORDER, f"Invalid Bloom level: {bloom_level}" , self.nodes.append((concept, bloom_level)) def add_edge(self, source: str, target: str): self.edges.append((source, target)) def build_graph(self) -> Dict[str, List[Tuple[str, str]]]: return {"nodes": self.nodes, "edges": self.edges} def validate_graph(self) -> bool: node_names = {n[0] for n in self.nodes} valid_edges = all(u in node_names and v in node_names for u, v in self.edges) , # Check for connectivity G = nx.Di Graph() G.add_edges_from(self.edges) connected = nx.is_weakly_connected(G) if G.number_of_nodes() > 0 else # Ensure Bloom level ordering is respected bloom_levels = [self.BLOOM_ORDER.index(level) for _, level in self.nodes] ordered = bloom_levels == sorted(bloom_levels) return valid_edges and connected and ordered ... class Agent3: def run(self, reference_srg, student_srg): ref = SRGBuilder("", "") stu = SRGBuilder("", "") ref.nodes, ref.edges = reference_srg['nodes'], reference_srg['edges'] stu.nodes, stu.edges = student_srg['nodes'], student_srg['edges'] score = stu.compute_similarity(reference_srg) missing_nodes = [n for n in reference_srg['nodes'] if n not in student_srg['nodes']] , missing_edges = [e for e in reference_srg['edges'] if e not in student_srg['edges']] , irrelevant_nodes = [n for n in student_srg['nodes'] if n not in reference_srg['nodes']] , irrelevant_edges = [e for e in student_srg['edges'] if e not in reference_srg['edges']] , # Compare Bloom level expectations bloom_discrepancies = [] ref_node_dict = dict(reference_srg['nodes']) for concept, level in student_srg['nodes']: if concept in ref_node_dict and level != ref_node_dict[concept]: bloom_discrepancies.append({ "concept": concept, "expected": ref_node_dict[concept], "observed": level }) # Classify sketch based on similarity score if score >= SCORE_THRESHOLD: label = "Proficient" elif score >= 0.5: label = "Developing" else: label = "Beginning" # Rank missing nodes by Bloom level priority_fix = sorted(missing_nodes, key=lambda x: SRGBuilder.BLOOM_ORDER.index(x[1])) , # Detect gaps in reasoning flow expected_sources = {src for src, _ in reference_srg['edges']} actual_sources = {src for src, _ in student_srg['edges']} conceptual_gaps = list(expected_sources actual_sources) "similarity_score": round(score, 3), "classification": label, "missing_nodes": missing_nodes, "missing_edges": missing_edges, "irrelevant_nodes": irrelevant_nodes, "irrelevant_edges": irrelevant_edges, "bloom_discrepancies": bloom_discrepancies, "priority_fix": priority_fix, "conceptual_gaps": conceptual_gaps }
Open Source Code	Yes	To promote transparency and facilitate further research, we have open-sourced our codebase at our repository2 and plan to make the dataset publicly available upon receiving the necessary approvals. This work represents a step forward in AI for Education, demonstrating how cognitively-aware, agentic systems can advance the quality, transparency, and effectiveness of automated reasoning over student-generated visual content. 2https://github.com/ehsanlatif/Sketch Mind
Open Datasets	Yes	Given this gap, we base our study on a rigorously developed dataset originally introduced by Zhai et al. [30], which has since become one of the most widely recognized resources for evaluating automated reasoning over student-generated scientific models. This dataset, adapted from the NGSA (Next Generation Science Assessment) initiative [10], aligns closely with the NGSS framework [26] and has been used by AIED researchers to assess students conceptual understanding through multimodal evidence [16, 15].
Dataset Splits	No	We conducted a structured human evaluation of model-generated feedback. Four domain-expert educators, each with graduate-level training in science education, independently assessed the pedagogical quality of system responses. Each rater evaluated a stratified random sample of 890 student-generated sketches (25% of the dataset), ensuring balanced representation across models (GPT-4o, GPT-4.1, O3), grade levels, and science task types.
Hardware Specification	Yes	Configuration 2 deploys open-source MLLMs, specifically the INT4quantized Llama-4 Maverick and INT8-quantized Scout (400B and 109B parameters respectively, with 17B active parameters), running locally on four NVIDIA H100 GPUs, facilitated by Hugging Face Transformers (version 4.39+) and Py Torch (version 2.1+). ... Open-source models are executed on-premises, utilizing four H100 GPUs with CUDA 12.1 and cu DNN 8.9, loading quantized model weights from the Hugging Face Model Hub, and multimodal data inputs are managed by the Llama4Processor.
Software Dependencies	Yes	Configuration 2 deploys open-source MLLMs, specifically the INT4quantized Llama-4 Maverick and INT8-quantized Scout (400B and 109B parameters respectively, with 17B active parameters), running locally on four NVIDIA H100 GPUs, facilitated by Hugging Face Transformers (version 4.39+) and Py Torch (version 2.1+). ... Open-source models are executed on-premises, utilizing four H100 GPUs with CUDA 12.1 and cu DNN 8.9, loading quantized model weights from the Hugging Face Model Hub, and multimodal data inputs are managed by the Llama4Processor.
Experiment Setup	Yes	Implementation Details. The SRG construction pipeline in SKETCHMIND utilizes a shared SRGBuilder class, which efficiently constructs, validates, and caches graphs, significantly reducing computational overhead and cost during repeated evaluations. Sketch adequacy is determined by a similarity threshold (τ = 0.75), with dynamic generation of visual hints guided by a reverse mapping (ϕ) embedded within Agent 1 s implementation. We calculate the sketch prediction accuracy for each assessment item by comparing with human-expert annotated proficiencies as (Sum of correctly predicted samples across each proficiency level/Total samples) and average for all items as (Sum of all item s accuracies/Total number of items). We have evaluated the performance of SKECTHMIND by decomposing it into combination of target model with proposed SRG. This decomposition can help us understand the impact of target model and proposed SRG to determination best possible combination for SKETCHMIND.