Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

NaDRO: Leveraging Dual-Reward Strategies for LLMs Training on Noisy Data

Authors: Haolong Qian, Xianliang Yang, Ling Zhang, Lei Song, Jiang Bian, Chun Yuan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate Na DRO s ability to handle noisy data in complex, long-horizon tasks, we model classic combinatorial optimization problems, such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP), as Markov Decision Processes. ... Our extensive experiments robustly show that small-sized models like Qwen 7B [15] and Llama 3.1-8B [16], when fine-tuned with Na DRO, significantly surpass the performance of leading LLMs, including GPT-4o and Deep Seek R1, on these demanding decision-making tasks.
Researcher Affiliation Collaboration 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Microsoft Research Asia, Microsoft
Pseudocode Yes Algorithm 1 Na DRO Offline Training Algorithm
Open Source Code Yes Code is released at https://github.com/microsoft/HeurAgenix/tree/Na DRO.
Open Datasets Yes TSP: We randomly selected a diverse set of 10 instances from the well-known TSPLIB benchmark library to evaluate our method. ... CVRP: We selected instances 1 through 10 from the Golden dataset for CVRP.
Dataset Splits Yes TSP: We randomly selected a diverse set of 10 instances from the well-known TSPLIB benchmark library to evaluate our method. CVRP: We selected instances 1 through 10 from the Golden dataset for CVRP. Training data in the form of (st, {ai, Q(st, ai)}NA i=1) pairs were generated using MCTS.
Hardware Specification Yes Experiments were performed on a cluster with NVIDIA A100 and NVIDIA A6000 GPUs, leveraging the Unsloth framework [33] for optimized training efficiency.
Software Dependencies No These settings were largely managed via the GRPOConfig class from the TRL (Transformer Reinforcement Learning) library, leveraging Unsloth for efficient training.
Experiment Setup Yes This section outlines the key hyperparameters and configuration settings employed during the finetuning of the Qwen2.5-7B-Instruct model using our Na DRO framework with the Group Relative Policy Optimization (GRPO) method. ... The primary parameters are detailed in Table 5.