Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Thompson Sampling for Multi-Objective Linear Contextual Bandit

Authors: Somangchan Park, Heesang Ann, Min-hwan Oh

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results confirm the benefits of our proposed approach, demonstrating improved regret minimization and strong multi-objective performance. Numerical experiments demonstrate the effectiveness of our proposed approach, showing improved performance in regret minimization, and objective-wise total reward maximization.
Researcher Affiliation Academia Somangchan Park Seoul National University EMAIL Heesang Ann Seoul National University EMAIL Min-hwan Oh Seoul National University EMAIL
Pseudocode Yes Algorithm 1: Multi-Objective Linear TS (MOL-TS)
Open Source Code Yes All codes of algorithms and experiments are provided in a ZIP file. Experimental results are provided in Section 6 and Appendix F.
Open Datasets No Each experimental setup contains 10 different instances with fixed number of arms K, objectives L, and feature dimension d. We demonstrate the case where K = 50, d = 5, L = 4. The parameter vector for each objective θ(ℓ) has a norm of 1. Each round, d-dimensional context vectors are revealed for every arm, bounded by 1 in Euclidean norm. Upon playing an arm, the agent receives a reward vector with an additional noise term, where the noise values are sampled from a zero mean Gaussian distribution with σ = 1.
Dataset Splits No The paper describes generating synthetic data for experiments and running for T=10000 rounds. It does not provide explicit training/test/validation dataset splits, which is common in contextual bandit settings where learning is sequential.
Hardware Specification Yes All experiments are conducted with INTEL(R) XEON(R) GOLD 6526Y CPU and 4 TB memory.
Software Dependencies Yes The software environment includes Python 3.12.7, Scipy 1.14.1, and Numpy 1.26.4.
Experiment Setup Yes We measure the Pareto regret and effective Pareto regret over T = 10000 rounds. Each experimental setup contains 10 different instances with fixed number of arms K, objectives L, and feature dimension d. We demonstrate the case where K = 50, d = 5, L = 4. The parameter vector for each objective θ(ℓ) has a norm of 1. Each round, d-dimensional context vectors are revealed for every arm, bounded by 1 in Euclidean norm. Upon playing an arm, the agent receives a reward vector with an additional noise term, where the noise values are sampled from a zero mean Gaussian distribution with σ = 1. MOL-ϵ-Greedy is basic MOMAB algorithm with ϵ = 0.05.