Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Fairshare Data Pricing via Data Valuation for Large Language Models
Authors: Luyang Zhang, Cathy Jiao, Beibei Li, Chenyan Xiong
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical and empirical findings show that fairshare pricing offers clear advantages compared to existing methods. First, we show that existing exploitative pricing leads to a lose-lose outcome for the data market. Second, we empirically validate our approach through simulations of buyer-seller interactions in data markets. We focus on training open-source LLMs on complex NLP tasks, including math problems [27], medical diagnosis [28], and physical reasoning [29]. Analyzing both pricing and valuation outcomes, we find that under fairshare pricing, buyers achieve higher model performance per dollar spent, making it particularly beneficial for those with limited budgets. In addition, our simulations of long-term market dynamics demonstrate that fairshare pricing encourages sustained seller participation, resulting in a stable and sufficient supply of training data over time compared to exploitative pricing. These findings show that our framework s data-valuation-based pricing not only improves short-term training efficiency, but also ensures the long-term viability of the data market. |
| Researcher Affiliation | Academia | Luyang Zhang * Carnegie Mellon University EMAIL Cathy Jiao * Carnegie Mellon University EMAIL Beibei Li Carnegie Mellon University EMAIL Chenyan Xiong Carnegie Mellon University EMAIL |
| Pseudocode | Yes | Algorithm 1 Determine if buyer Bk will purchase dataset Dj at price pj |
| Open Source Code | Yes | Our code/data will be openly availiable on Git Hub |
| Open Datasets | Yes | We focus on challenging, human-annotated tasks: Math QA and GSM8K [27, 84] for math, Med QA [28] for medical diagnosis, and PIQA [29] for physical reasoning [85 87]. Table 1 in Appendix F shows dataset splits and examples. |
| Dataset Splits | Yes | Table 1 in Appendix F shows dataset splits and examples. Math QA 29837/4475/2985 GSM8K 7473/1319 Med QA 10178/1272/1273 PIQA 16000/2000 |
| Hardware Specification | Yes | All models are trained on A6000 GPUs on single GPU settings and take less than 1 hour. |
| Software Dependencies | No | No specific software versions are explicitly mentioned in the paper. The paper mentions "LoRA [91]" which is a technique, not a software dependency with a specific version. |
| Experiment Setup | Yes | We train each model (i.e., buyer) on these samples separately using LoRA [91] for 3 epochs, with a learning rate of 2e-5 and batch size 32. |