Finite-Time Convergence and Sample Complexity of Actor-Critic Multi-Objective Reinforcement Learning

Authors: Tianchen Zhou, Fnu Hairi, Haibo Yang, Jia Liu, Tian Tong, Fan Yang, Michinari Momma, Yan Gao

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, experiments conducted on a real-world dataset validate the effectiveness of our proposed method.
Researcher Affiliation Collaboration 1Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, USA 2Amazon.com, Seattle, WA, USA 3Department of Computer Science, University of Wisconsin-Whitewater, WI, USA 4Department of Computing and Information Sciences, Rochester Institute of Technology, Rochester, NY, USA.
Pseudocode Yes Algorithm 1 MOAC critic with mini-batch TD-learning. Algorithm 2 The overall MOAC algorithmic framework.
Open Source Code No The paper does not contain an explicit statement about releasing code or a link to a code repository for the described methodology.
Open Datasets No The paper mentions using a "real-world dataset collected from the recommendation logs of the video-sharing mobile app Kuaishou" but does not provide any access information (link, DOI, or formal citation with authors/year for public access) for this dataset. While it cites the MOGymnasium environment, this is an environment library, not a dataset in the sense of pre-collected data files for public consumption.
Dataset Splits No The paper describes the use of synthetic and real-world data, but it does not specify explicit train/validation/test dataset splits using percentages, absolute counts, or references to predefined splits.
Hardware Specification No The paper mentions conducting experiments but does not provide any specific details about the hardware used, such as GPU models, CPU specifications, or cloud computing instance types.
Software Dependencies No The paper states, "For all the methods, we leverage ADAM to optimize the parameters." However, it does not provide specific version numbers for ADAM or any other software libraries or dependencies.
Experiment Setup Yes We use an open-source MOMDP environment MOGymnasium (Alegre et al., 2022) to conduct synthetic simulations on environment resource-gathering-v0, which has three reward signals. We test MOAC in the discounted reward setting with momentum coefficient ηt chosen from {t 1/2, t 1, t 2}. For our method, we set the momentum coefficient of gradient weight by ηt = 1/t (without pre-specifying values, the gradient weights are initialized by the solution to a QP problem regarding the average gradients of the first batch of samples), and set the same gradient weight initialization for all the other methods. We leverage ADAM to optimize the parameters. In both steps, we use constant step-sizes and mini-batch Markovian sampling.