Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CAD-Coder: Text-to-CAD Generation with Chain-of-Thought and Geometric Reward

Authors: Yandong Guan, Xilin Wang, XiMing Xing, Jing Zhang, Dong Xu, Qian Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that CAD-Coder enables LLMs to generate diverse, valid, and complex CAD models directly from natural language, advancing the state of the art of text-to-CAD generation and geometric reasoning. ... Extensive experiments demonstrate that our method unlocks new capabilities for LLMs, enabling the generation of complex, functional CAD models directly from high-level textual intent. ... 5 Experiments ... 6 Extended Ablation Studies
Researcher Affiliation	Academia	Yandong Guan School of Software Beihang University Beijing, China EMAIL Xilin Wang School of Software Beihang University Beijing, China EMAIL Ximing Xing School of Software Beihang University Beijing, China EMAIL Jing Zhang School of Software Beihang University Beijing, China EMAIL Dong Xu The University of Hong Kong Hong Kong, China EMAIL Qian Yu School of Software Beihang University Beijing, China EMAIL
Pseudocode	Yes	Co T Reasoning Steps (Generated by CAD-Coder): 1. Description Analysis : Two parts: first rectangular prism , second triangular prism First part has dimensions 0.3750.3750.375 , rotated -90 about Z Second part is 0.250.250.375 , rotated -90 about X and Z , translated to (0.375 ,0.125 ,0) Both parts translate to origin initially 2. Coordinate System Planning : First part uses identity CS , second rotated -90 X/-90 Z at (0.375 ,0.125 ,0) 3. Sketch Construction Strategy : First part: 4-line rectangle sketch Second part: 3-line triangular sketch Apply scaling factors (0.375 and 0.25) to coordinates 4. Extrusion Operation Planning : First part extruded -0.375Z Second part extruded +0.375Z Combine using union operation 5. Code Implementation Strategy : Create base sketches Apply rotations and translations Extrude and combine
Open Source Code	No	We utilized the Hugging Face Transformers library, GRPO implementation from Verl [25], and Deep Speed for distributed training. ... The NeurIPS checklist question 5 asks: 'Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?' Answer: [No] Justification: [No]
Open Datasets	No	We build our dataset based on the Text2CAD dataset [13], which contains 178K natural language descriptions L paired with ground-truth 3D geometries Mgt. ... In total, this pipeline produces 110K valid triplets (L, Cgt, Mgt). ... The final Co T dataset contains 1.5K high-quality Co T samples. ... The NeurIPS checklist question 5 asks: 'Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?' Answer: [No] Justification: [No]
Dataset Splits	Yes	In total, this pipeline produces 110K valid triplets (L, Cgt, Mgt). We further divide them into three subsets based on geometric quality: 8k high-quality samples with CDgt < 1 10 4; 70k mediumquality samples with CDgt < 1 10 3; and the remaining 32k hard cases with CDgt > 1 10 3. ... For SFT, we use the 8K high-quality samples. For cold-starting, we use the 1.5K Co T-format samples. For GRPO, we use all 150K training descriptions and geometries from Text2CAD. For evaluation, we apply the same synthesis pipeline on the official Text2CAD test set to obtain corresponding triplets.
Hardware Specification	Yes	All experiments were conducted on 8 NVIDIA A800 80GB GPUs. ... Table 2: Per-sample inference latency (seconds; lower is better). GPU Model Co T (s) SFT (s) H800 80G 0.06 0.03 A800 80G 0.18 0.12 RTX 4090 24G 0.28 0.16 V100 32G 0.64 0.29 ... All experiments were executed on a cluster equipped with 8 NVIDIA A800 (80GB) GPUs.
Software Dependencies	Yes	We utilized the Hugging Face Transformers library, GRPO implementation from Verl [25], and Deep Speed for distributed training. ... For efficient model inference, we employed v LLM, and used Cad Query (version 2.3.1) for CAD script execution and validation.
Experiment Setup	Yes	For the stage of SFT, we fine-tuned Qwen2.5-7B-Instruct for 3 epochs with a batch size of 64 and a learning rate of 1 10 5, using the Adam W optimizer [18]. Training was performed using full-parameter fine-tuning with Deep Speed Ze RO Stage 2. For the GRPO phase, we initialized the model with SFT weights and trained for 1 epoch with a batch size of 384. To enable cold-starting of reasoning during SFT, we additionally fine-tuned the model on the 1.5K high-quality Co T-format samples for 2 epochs. The batch size was set to 384. Each input prompt generated k = 8 candidate completions. The KL divergence coefficient was set to β = 0.001.