Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Matching Markets Meet LLMs: Algorithmic Reasoning with Ranked Preferences

Authors: Hadi Hosseini, Samarth Khanna, Ronak Singh

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate seven stateof-the-art models on a hierarchy of preference-based reasoning tasks ranging from stable-matching generation to instability detection, instability resolution, and finegrained preference queries to systematically expose their logical and algorithmic limitations in handling ranked inputs. We further show that parameter-efficient fine-tuning (Lo RA) significantly improves performance in small markets, but fails to bring about a similar improvement in large instances, suggesting the need for more sophisticated strategies to improve LLMs reasoning with larger-context inputs.
Researcher Affiliation	Academia	Hadi Hosseini Penn State University, USA EMAIL Samarth Khanna Penn State University, USA EMAIL Ronak Singh Penn State University, USA EMAIL
Pseudocode	Yes	Algorithm 1 The Deferred Acceptance Algorithm assign each agent m M and w W to be free while there exists a free man m who has not proposed to every woman do w highest-ranked woman on m s preference list to whom he has not yet proposed m proposes to w if w is free then w tentatively accepts m else if w prefers m to her current partner m then w rejects m and tentatively accepts m m becomes free else w rejects m end if end while return the set of engaged pairs, these form a stable matching
Open Source Code	Yes	Data and Code: github.com/Samarth Khanna/LLM_Matching_Markets
Open Datasets	Yes	Data and Code: github.com/Samarth Khanna/LLM_Matching_Markets. We synthetically sample a set of 300 preference profiles, partitioned into three sets of 100 instances for each difficulty level, namely Easy (n = 10 agents on each side of the market), Medium (n = 20), and Hard (n = 50).
Dataset Splits	Yes	We synthetically sample a set of 300 preference profiles, partitioned into three sets of 100 instances for each difficulty level, namely Easy (n = 10 agents on each side of the market), Medium (n = 20), and Hard (n = 50). The preference profiles are sampled from two types of distributions Impartial Culture (IC) and Master-list (ML), each constituting 50 questions at each difficulty level.
Hardware Specification	Yes	Each model was fine-tuned using a single NVIDIA H100 GPU (80GB RAM) with CUDA support; model and inputs were explicitly transferred to GPU for inference and training. We used a single GPU for inference involving Deep Seek-8B and Deep Seek-14B, two GPUs for inference involving Qwen-Qw Q-32B, and four GPUs for inference involving Llama-3.3-70B and Deep Seek-70B.
Software Dependencies	No	We used the Unsloth framework with parameter-efficient tuning (Lo RA). Fine-tuning was conducted using the SFTTrainer from the TRL library. (No specific version numbers for Unsloth or TRL library are provided in the text.)
Experiment Setup	Yes	Training Configuration. Fine-tuning was conducted using the SFTTrainer from the TRL library with the following training arguments: Batch size per device: 2 (1, for Qwen-Qw Q-32B) Gradient accumulation steps: 4 (2, for Qwen-Qw Q-32B) Learning rate: 2 × 10−4 with a linear scheduler and 5 warmup steps Optimizer: AdamW-8bit Weight decay: 0.01 Precision: Mixed precision (FP16 or BF16, based on hardware support)