Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Fairshare Data Pricing via Data Valuation for Large Language Models

Authors: Luyang Zhang, Cathy Jiao, Beibei Li, Chenyan Xiong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical and empirical findings show that fairshare pricing offers clear advantages compared to existing methods. First, we show that existing exploitative pricing leads to a lose-lose outcome for the data market. Second, we empirically validate our approach through simulations of buyer-seller interactions in data markets. We focus on training open-source LLMs on complex NLP tasks, including math problems [27], medical diagnosis [28], and physical reasoning [29]. Analyzing both pricing and valuation outcomes, we find that under fairshare pricing, buyers achieve higher model performance per dollar spent, making it particularly beneficial for those with limited budgets. In addition, our simulations of long-term market dynamics demonstrate that fairshare pricing encourages sustained seller participation, resulting in a stable and sufficient supply of training data over time compared to exploitative pricing. These findings show that our framework s data-valuation-based pricing not only improves short-term training efficiency, but also ensures the long-term viability of the data market.
Researcher Affiliation Academia Luyang Zhang * Carnegie Mellon University EMAIL Cathy Jiao * Carnegie Mellon University EMAIL Beibei Li Carnegie Mellon University EMAIL Chenyan Xiong Carnegie Mellon University EMAIL
Pseudocode Yes Algorithm 1 Determine if buyer Bk will purchase dataset Dj at price pj
Open Source Code Yes Our code/data will be openly availiable on Git Hub
Open Datasets Yes We focus on challenging, human-annotated tasks: Math QA and GSM8K [27, 84] for math, Med QA [28] for medical diagnosis, and PIQA [29] for physical reasoning [85 87]. Table 1 in Appendix F shows dataset splits and examples.
Dataset Splits Yes Table 1 in Appendix F shows dataset splits and examples. Math QA 29837/4475/2985 GSM8K 7473/1319 Med QA 10178/1272/1273 PIQA 16000/2000
Hardware Specification Yes All models are trained on A6000 GPUs on single GPU settings and take less than 1 hour.
Software Dependencies No No specific software versions are explicitly mentioned in the paper. The paper mentions "LoRA [91]" which is a technique, not a software dependency with a specific version.
Experiment Setup Yes We train each model (i.e., buyer) on these samples separately using LoRA [91] for 3 epochs, with a learning rate of 2e-5 and batch size 32.