Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ARMesh: Autoregressive Mesh Generation via Next-Level-of-Detail Prediction

Authors: Jiabao Lei, Kewei Shi, Zhihao Liang, Kui Jia

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments Dataset. Our 3D mesh data mainly come from Stanford 3D Scans [50], Thingi10K [63], Ne RF [35], AMASS [33], and Shape Net [2]. Metrics. Following previous works [32, 45, 51, 62], we adopt the following metrics for evaluation: Minimum Matching Distance (MMD), Coverage (COV), and 1-Nearest-Neighbor Accuracy (1-NNA). For MMD, lower values are better; for COV, higher values are better; and for 1-NNA, 50% is optimal. These metrics are calculated based on Chamfer Distance (CD). To measure visual similarity, we also use FID and KID scores to assess visual quality, with lower values being better. We also report mesh compactness as the average number of vertices and faces per mesh. Implementation Details. For simplification, we set the penalty factors for vertices, boundary edges, and faces to 0, 1, and 1, respectively, as their default values in our experiments. We follow the standard techniques in LLM, employing Byte Pair Encoding (BPE) to compress tokens. The entire vocabulary size is 16,384, resulting in a length reduction of 2 3 times. We basically follow the setup of Mesh GPT [45] for our comparative experiments. We adopt a 12-layer transformer with a width of 768 for training, utilizing variable-length bf16 flash attention [11, 12] for efficient training. The average batch size is 60 per GPU. The learning rate is 10 4 for pre-training across all categories and 10 5 for supervised fine-tuning on specific categories (chair, table, bench, and lamp). The model is pretrained on 4 H20 GPUs for 4 days and fine-tuned on 2 H20 GPUs for 2 days. 5.1 Ablation Studies Simplification with Different Levels-of-detail. In this part, we study how the geometric reconstruction accuracy varies w.r.t. different LODs using our simplification algorithm. We test a very complex LEGO shape from [35] that contains up to 2 million faces. Visualization results, along with the numerical accuracy, are presented in Figure 7. We find that our simplification algorithm indeed produces meshes with different LODs that consistently mimic the original shape.
Researcher Affiliation	Collaboration	Jiabao Lei1 , Kewei Shi2 , Zhihao Liang3, Kui Jia1 1 The Chinese University of Hong Kong, Shenzhen 2 The University of Hong Kong 3 Tencent Hunyuan Corresponding Author: EMAIL
Pseudocode	No	The paper describes methods and processes in narrative text and figures (e.g., Figure 5 and Figure 6 illustrate tokenization and learning process), but it does not contain explicit pseudocode blocks or algorithms in a structured, code-like format.
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide demo code in the supplementary materials to help examine our method.
Open Datasets	Yes	Dataset. Our 3D mesh data mainly come from Stanford 3D Scans [50], Thingi10K [63], Ne RF [35], AMASS [33], and Shape Net [2].
Dataset Splits	Yes	The Shape Net data has been preprocessed by Siddiqui et al. [45], which contains < 1, 700 vertices and < 800 faces for each mesh, and we follow their settings. We basically follow the setup of Mesh GPT [45] for our comparative experiments.
Hardware Specification	Yes	The model is pretrained on 4 H20 GPUs for 4 days and fine-tuned on 2 H20 GPUs for 2 days.
Software Dependencies	No	The paper mentions 'Byte Pair Encoding (BPE)', 'transformer network [52]', and 'variable-length bf16 flash attention [11, 12]', but it does not provide specific version numbers for these or other key software components like programming languages or libraries.
Experiment Setup	Yes	Implementation Details. For simplification, we set the penalty factors for vertices, boundary edges, and faces to 0, 1, and 1, respectively, as their default values in our experiments. We follow the standard techniques in LLM, employing Byte Pair Encoding (BPE) to compress tokens. The entire vocabulary size is 16,384, resulting in a length reduction of 2 3 times. We basically follow the setup of Mesh GPT [45] for our comparative experiments. We adopt a 12-layer transformer with a width of 768 for training, utilizing variable-length bf16 flash attention [11, 12] for efficient training. The average batch size is 60 per GPU. The learning rate is 10 4 for pre-training across all categories and 10 5 for supervised fine-tuning on specific categories (chair, table, bench, and lamp). The model is pretrained on 4 H20 GPUs for 4 days and fine-tuned on 2 H20 GPUs for 2 days.