Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Jury-and-Judge Chain-of-Thought for Uncovering Toxic Data in 3D Visual Grounding

Authors: Kaixiang Huang, Qifeng Zhang, Jin Wang, Jingru Yang, Yang Zhou, Huan Yu, Guodong Lu, Shengfeng He

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our framework not only achieves human-level discrimination at the scene level but also improves the performance of baseline algorithms via data purification. Code is available at https://github.com/Hermione-HKX/Refer_Judge. ... We validate the effectiveness of Refer-Judge through extensive experiments, showing both its alignment with human judgments and its ability to improve baseline 3DVG model performance when trained on filtered data.
Researcher Affiliation	Academia	1Zhejiang University 2Carnegie Mellon University 3Singapore Management University. Corresponding author. EMAIL
Pseudocode	No	The paper describes methods and processes in text, using formulations and figures to illustrate the architecture (Figure 2, Figure 3), but it does not present any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/Hermione-HKX/Refer_Judge.
Open Datasets	Yes	We conduct experiments on the proposed Scan Refer-Justice dataset to verify the effectiveness of Refer-Judge. Built upon the widely-used Scan Refer benchmark [7], Scan Refer-Justice introduces reliable 3DVG judgments annotated and verified by human experts... Additionally, we evaluate baseline models on the Scan Refer dataset to demonstrate how identifying and removing toxic annotations improves model performance. ... The proposed dataset is constructed based on Scan Refer and adheres to its license agreement (CC BY-NC-SA 3.0).
Dataset Splits	Yes	We choose 3,001 data from the Scan Refer training and validation sets to form Scan Refer-Justice, covering 162 scenes. ... Additionally, we evaluate baseline models on the Scan Refer dataset to demonstrate how identifying and removing toxic annotations improves model performance. ... We further separate the validation set into purified and toxic subsets using Refer-Judge.
Hardware Specification	Yes	Our experiments are conducted on a computational platform equipped with Intel(R) Xeon(R) CPU E52680v3 @2.50 GHz CPU x2, 128G memory, and RTX 4090 GPU x8. The inference of LLAMA is conducted using a dual-GPU setup.
Software Dependencies	No	To ensure generality, all MLLMs are used with default configurations. Proprietary models, such as GPT-4o, are accessed through APIs. LLAMA-3.2 is deployed with the released checkpoints. The paper mentions the MLLMs used (e.g., GPT-4o, LLAMA-3.2) but does not specify ancillary software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch), or their version numbers, which are typically needed for reproducibility.
Experiment Setup	Yes	To ensure generality, all MLLMs are used with default configurations. Proprietary models, such as GPT-4o, are accessed through APIs. LLAMA-3.2 is deployed with the released checkpoints. ... Notably, all models are evaluated in a zero-shot manner without fine-tuning on Scan Refer-Justice.