Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Personalized Decision Modeling: Utility Optimization or Textualized-Symbolic Reasoning

Authors: Yibo Zhao, Yang Zhao, Hongru Du, Hao Frank Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Validated on real-world travel mode and vaccine choice tasks, ATHENA consistently outperforms utility-based, machine learning, and other LLM-based models, lifting F1 score by at least 6.5% over the strongest cutting-edge models. Further, ablation studies confirm that both stages of ATHENA are critical and complementary, as removing either clearly degrades overall predictive performance. By organically integrating symbolic utility modeling and semantic adaptation, ATHENA provides a new scheme for modeling human-centric decisions.
Researcher Affiliation	Academia	Yibo Zhao Department of Civil and Systems Engineering Johns Hopkins University Yang Zhao Department of Civil and Systems Engineering Johns Hopkins University Hongru Du Department of Systems and Information Engineering University of Virginia EMAIL Hao Frank Yang Department of Civil and Systems Engineering Johns Hopkins Data Science and AI Institute Johns Hopkins University EMAIL
Pseudocode	Yes	Algorithm 1 ATHENA Optimization Flow Require: Demographic group g, dataset Dg, domain concept C, symbolic building block S 1: Initialize B0 None // Stage 1: Group-Level Symbolic Utility Discovery 2: for t = 1 to T do 3: Sample symbolic utility functions {f t g,k}K k=1 ̘( \| g, C, S, Bt 1) 4: Update Bt {f t g,+, f t g, } using Eq. (4) 5: Select best function f g arg minf Fg Lg(f, Dg) 6: if stopping condition in Eq. (5) is met then 7: break 8: end if 9: end for // Stage 2: Individual-Level Semantic Adaptation 10: for each individual i g do 11: Initialize semantic template P0 i ̘( \| f g , i, C) 12: for t = 1 to T do 13: Update Pt+1 i Pt i Η Li(Pt i , Di) using Eq. (7) 14: end for 15: end for 16: return {P i }i g, predict decisions using Eq. (8).
Open Source Code	Yes	The project page can be found at https://yibozh.github.io/Athena. Yes, the code and data will be published on Git Hub upon acceptance.
Open Datasets	Yes	(1) Swissmetro Transportation Choice (Swissmetro): is a widely used benchmark in travel mode choice modeling [97 101]. Each record details a trip between major Swiss cities and includes both traveler characteristics (e.g., income, age) and alternative-specific attributes (e.g., travel time, cost). The dataset has a potential choice set of three transportation modes: Train, Car, and Metro. (2) COVID-19 Vaccination Choice (Vaccine): This dataset is derived from a large-scale international survey, conducted across multiple countries [102].
Dataset Splits	Yes	To maintain a reasonable budget for the template-adaptation stage, we restricted the experimental sample to a representative subset of each dataset. Specifically, we used: (1) Swissmetro: 500 travelers, two trip records per person; (2) Vaccine: 300 respondents, one survey record per person. Within each dataset, we first identified key demographic dimensions (gender, age, and income), then sampled approximately balanced subsets across these strata from the full dataset. This ensures (i) comparable class priors between training and test splits, and (ii) that no demographic group dominates the symbolic-utility discovery process. Table 9: Utility-based models and key settings (train : test = 0.8 : 0.2) Table 11: Machine learning models and key settings (train : test = 0.8 : 0.2)
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It mentions LLM backbones used (e.g., gpt-4o-mini) but not the physical hardware.
Software Dependencies	Yes	Both stages of ATHENA, symbolic-utility discovery and individual semantic adaptation, run on the gpt-4o-mini-2024-07-18 and gemini-2.0-flash.
Experiment Setup	Yes	4 Experiments Experiment Configurations. To maintain a reasonable budget for the template-adaptation stage, we restricted the experimental sample to a representative subset of each dataset. Specifically, we used: (1) Swissmetro: 500 travelers, two trip records per person; (2) Vaccine: 300 respondents, one survey record per person. Within each dataset, we first identified key demographic dimensions (gender, age, and income), then sampled approximately balanced subsets across these strata from the full dataset. This ensures (i) comparable class priors between training and test splits, and (ii) that no demographic group dominates the symbolic-utility discovery process. The predefined demographic grouping follows established practice in choice modeling, supports interpretability, and improves robustness by avoiding the complexity and data requirements of latent clustering methods [103 105]. Evaluation metrics. We report Accuracy, F1, AUC, and Cross-Entropy (CE). Models and baselines. Both stages of ATHENA, symbolic-utility discovery and individual semantic adaptation, run on the gpt-4o-mini-2024-07-18 and gemini-2.0-flash. To evaluate its performance, we contrasted ATHENA with three baseline groups. (i) LLM-based methods: a plain zero-shot method [106, 107], a zero-shot chain-of-thought method [106], a five-example few-shot method [108, 109], and Text Grad tuning [96]. (ii) Classical discrete-choice models: Multinomial Logit (MNL) [110], Conditional Logit (CLogit) [111], and Latent-Class MNL [112]. (iii) Standard machine-learning classifiers: logistic regression, random forest, XGBoost [113], a shallow two-layer MLP [114], Tab Net for tabular data [115], and a fine-tuned BERT classifier [116]. Appendix B Baseline Setup Table 9: Utility-based models and key settings (train : test = 0.8 : 0.2) Model Best hyper-parameters Logistic Regression C=10, penalty=l2, solver=saga Random Forest bootstrap=False, max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=600 XGBoost colsample_bytree=0.8, learning_rate=0.05, max_depth=6, n_estimators=500, subsample=0.8