Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Jury-and-Judge Chain-of-Thought for Uncovering Toxic Data in 3D Visual Grounding
Authors: Kaixiang Huang, Qifeng Zhang, Jin Wang, Jingru Yang, Yang Zhou, Huan Yu, Guodong Lu, Shengfeng He
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our framework not only achieves human-level discrimination at the scene level but also improves the performance of baseline algorithms via data purification. Code is available at https://github.com/Hermione-HKX/Refer_Judge. ... We validate the effectiveness of Refer-Judge through extensive experiments, showing both its alignment with human judgments and its ability to improve baseline 3DVG model performance when trained on filtered data. |
| Researcher Affiliation | Academia | 1Zhejiang University 2Carnegie Mellon University 3Singapore Management University. Corresponding author. EMAIL |
| Pseudocode | No | The paper describes methods and processes in text, using formulations and figures to illustrate the architecture (Figure 2, Figure 3), but it does not present any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/Hermione-HKX/Refer_Judge. |
| Open Datasets | Yes | We conduct experiments on the proposed Scan Refer-Justice dataset to verify the effectiveness of Refer-Judge. Built upon the widely-used Scan Refer benchmark [7], Scan Refer-Justice introduces reliable 3DVG judgments annotated and verified by human experts... Additionally, we evaluate baseline models on the Scan Refer dataset to demonstrate how identifying and removing toxic annotations improves model performance. ... The proposed dataset is constructed based on Scan Refer and adheres to its license agreement (CC BY-NC-SA 3.0). |
| Dataset Splits | Yes | We choose 3,001 data from the Scan Refer training and validation sets to form Scan Refer-Justice, covering 162 scenes. ... Additionally, we evaluate baseline models on the Scan Refer dataset to demonstrate how identifying and removing toxic annotations improves model performance. ... We further separate the validation set into purified and toxic subsets using Refer-Judge. |
| Hardware Specification | Yes | Our experiments are conducted on a computational platform equipped with Intel(R) Xeon(R) CPU E52680v3 @2.50 GHz CPU x2, 128G memory, and RTX 4090 GPU x8. The inference of LLAMA is conducted using a dual-GPU setup. |
| Software Dependencies | No | To ensure generality, all MLLMs are used with default configurations. Proprietary models, such as GPT-4o, are accessed through APIs. LLAMA-3.2 is deployed with the released checkpoints. The paper mentions the MLLMs used (e.g., GPT-4o, LLAMA-3.2) but does not specify ancillary software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch), or their version numbers, which are typically needed for reproducibility. |
| Experiment Setup | Yes | To ensure generality, all MLLMs are used with default configurations. Proprietary models, such as GPT-4o, are accessed through APIs. LLAMA-3.2 is deployed with the released checkpoints. ... Notably, all models are evaluated in a zero-shot manner without fine-tuning on Scan Refer-Justice. |