Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RadarQA: Multi-modal Quality Analysis of Weather Radar Forecasts

Authors: Xuming He, Zhiyuan You, Junchao Gong, Couhua Liu, Xiaoyu Yue, Peiqin Zhuang, Wenlong Zhang, LEI BAI

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments are conducted to evaluate the effectiveness of Radar QA. First, with the support of RQA-70K, Radar QA outperforms open-source MLLMs by a large margin (e.g., 66.17% v.s. 36.70% in overall sequence rating). Second, our Radar QA can generate a detailed and comprehensive assessment report, as shown in Fig. 2, even surpassing the powerful Open AI o1 [29] (6.58 v.s. 5.49 in GPT-4 Score for sequence assessment). These results demonstrate the superiority of Radar QA and highlight the research potential of multi-modal weather forecast analysis tasks.
Researcher Affiliation	Collaboration	Xuming He1,2 , Zhiyuan You3 , Junchao Gong1, Couhua Liu4, Xiaoyu Yue1, Peiqin Zhuang1, Wenlong Zhang1 , Lei Bai1 1 Shanghai Artificial Intelligence Laboratory 2 Zhe Jiang University 3 The Chinese University of Hong Kong 4 Center for Earth System Modeling and Prediction of China Meteorological Administration EMAIL, EMAIL
Pseudocode	No	The paper describes the model training pipeline and task paradigm but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code and dataset are publicly available at https://github.com/hexm See U/Radar QA.
Open Datasets	Yes	To support training and benchmarking, we design a hybrid annotation pipeline that combines human expert labeling with automated heuristics. With such an annotation method, we construct RQA-70K, a large-scale dataset with varying difficulty levels for radar forecast quality evaluation. We further design a multi-stage training strategy that iteratively improves model performance at each stage. Extensive experiments show that Radar QA outperforms existing general MLLMs across all evaluation settings, highlighting its potential for advancing quality analysis in weather prediction. The code and dataset are publicly available at https://github.com/hexm See U/Radar QA.
Dataset Splits	Yes	The statistics of our dataset are summarized in Tab. 1. Our dataset consists of 40,000 brief templated samples (training set of rating tasks), along with 29,000 detailed, high-quality samples (training set of assessment tasks). To ensure the reliability of these samples, all annotations undergo expert validation, and automated annotations are routinely verified through expert spot-checking on sampled batches to ensure accuracy.
Hardware Specification	Yes	The entire training process takes approximately 50 hours using 8 NVIDIA A800 GPUs.
Software Dependencies	No	We adopt Qwen-2.5-VL-7B [3] as the base model. In Stage 1, we employ Adam W as the optimizer, with an initial learning rate of 1 10 4. We integrate Lo RA with a rank of 8, The model is trained with a total batch size of 128 for 5 epochs on RQA-70K. In Stage 2, we set the generation number of GRPO to 4, and train the model for 1 epoch on 10,000 randomly selected brief task samples with a total batch size of 32. In Stage 3, we set the Lo RA rank to 4 and fine-tune the model for 1 epoch using 2,500 samples from each sub-task. The paper mentions specific models like Qwen-2.5-VL-7B but does not provide version numbers for general software dependencies like PyTorch, TensorFlow, or AdamW.
Experiment Setup	Yes	In Stage 1, we employ Adam W as the optimizer, with an initial learning rate of 1 10 4. We integrate Lo RA with a rank of 8, The model is trained with a total batch size of 128 for 5 epochs on RQA-70K. In Stage 2, we set the generation number of GRPO to 4, and train the model for 1 epoch on 10,000 randomly selected brief task samples with a total batch size of 32. In Stage 3, we set the Lo RA rank to 4 and fine-tune the model for 1 epoch using 2,500 samples from each sub-task.