Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning Personalized Ad Impact via Contextual Reinforcement Learning under Delayed Rewards

Authors: Yuwei Cheng, Zifeng Zhao, Haifeng Xu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical findings are validated by simulation experiments. Finally, we perform simulation studies which validate our theoretical findings.
Researcher Affiliation Academia Yuwei Cheng Department of Statistics University of Chicago Chicago, IL 60637 EMAIL Zifeng Zhao Mendoza College of Business University of Notre Dame Notre Dame, IN 46556 EMAIL Haifeng Xu Department of Computer Science and Data Science University of Chicago Chicago, IL 60637 EMAIL
Pseudocode Yes Algorithm 1 Online Contextual Reinforcement Learning with Delayed Poisson Reward input d, T, H, b, Bx, Bθ, Bd, BA, δ, γ, Γ, nl
Open Source Code No However, given the lack of publicly available online advertising and bidding datasets, we leave this empirical validation to future work as an important step toward bridging the gap between theory and practice.
Open Datasets No However, given the lack of publicly available online advertising and bidding datasets, we leave this empirical validation to future work as an important step toward bridging the gap between theory and practice.
Dataset Splits Yes We simulate a second-price auction with horizon H = 3 (a realistic setting since advertisers typically show ads only a few times per user), context dimension d = 2, and T = 20,000 sequential customers. The first 2400 rounds are used for exploration, and the remaining rounds for exploitation.
Hardware Specification No Environment Setup. We simulate a second-price auction with horizon H = 3 (a realistic setting since advertisers typically show ads only a few times per user), context dimension d = 2, and T = 20,000 sequential customers. The first 2400 rounds are used for exploration, and the remaining rounds for exploitation.
Software Dependencies No Estimators. We estimate four sets of parameters: θl via the online Newton method (Algorithm 3) with truncation threshold Γ = 100,000, bound Bθ = 10, zero initialization, and V0 = I. We estimate delay impact dl via the two-stage MLE (Eq. (3)) using Dt,l, ˆθ, and xt; β by ridge regression (Eq. (5), λ = 1.0); and σ via empirical variance (Eq. (6)).
Experiment Setup Yes Environment Setup. We simulate a second-price auction with horizon H = 3 (a realistic setting since advertisers typically show ads only a few times per user), context dimension d = 2, and T = 20,000 sequential customers. The first 2400 rounds are used for exploration, and the remaining rounds for exploitation. Context vectors xt R2, as well as parameters dl, βh, σh, are sampled elementwise from |N(0, 1)| + 0.1 to ensure positivity, while θl 5|N(0, 1)| + 0.1. Under this configuration, each ad impression yields an average instantaneous reward roughly five times its cost (i.e., the highest other bid). Estimators. We estimate four sets of parameters: θl via the online Newton method (Algorithm 3) with truncation threshold Γ = 100,000, bound Bθ = 10, zero initialization, and V0 = I. We estimate delay impact dl via the two-stage MLE (Eq. (3)) using Dt,l, ˆθ, and xt; β by ridge regression (Eq. (5), λ = 1.0); and σ via empirical variance (Eq. (6)).