Finite-Time Convergence and Sample Complexity of Actor-Critic Multi-Objective Reinforcement Learning
Authors: Tianchen Zhou, Fnu Hairi, Haibo Yang, Jia Liu, Tian Tong, Fan Yang, Michinari Momma, Yan Gao
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, experiments conducted on a real-world dataset validate the effectiveness of our proposed method. |
| Researcher Affiliation | Collaboration | 1Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, USA 2Amazon.com, Seattle, WA, USA 3Department of Computer Science, University of Wisconsin-Whitewater, WI, USA 4Department of Computing and Information Sciences, Rochester Institute of Technology, Rochester, NY, USA. |
| Pseudocode | Yes | Algorithm 1 MOAC critic with mini-batch TD-learning. Algorithm 2 The overall MOAC algorithmic framework. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing code or a link to a code repository for the described methodology. |
| Open Datasets | No | The paper mentions using a "real-world dataset collected from the recommendation logs of the video-sharing mobile app Kuaishou" but does not provide any access information (link, DOI, or formal citation with authors/year for public access) for this dataset. While it cites the MOGymnasium environment, this is an environment library, not a dataset in the sense of pre-collected data files for public consumption. |
| Dataset Splits | No | The paper describes the use of synthetic and real-world data, but it does not specify explicit train/validation/test dataset splits using percentages, absolute counts, or references to predefined splits. |
| Hardware Specification | No | The paper mentions conducting experiments but does not provide any specific details about the hardware used, such as GPU models, CPU specifications, or cloud computing instance types. |
| Software Dependencies | No | The paper states, "For all the methods, we leverage ADAM to optimize the parameters." However, it does not provide specific version numbers for ADAM or any other software libraries or dependencies. |
| Experiment Setup | Yes | We use an open-source MOMDP environment MOGymnasium (Alegre et al., 2022) to conduct synthetic simulations on environment resource-gathering-v0, which has three reward signals. We test MOAC in the discounted reward setting with momentum coefficient ηt chosen from {t 1/2, t 1, t 2}. For our method, we set the momentum coefficient of gradient weight by ηt = 1/t (without pre-specifying values, the gradient weights are initialized by the solution to a QP problem regarding the average gradients of the first batch of samples), and set the same gradient weight initialization for all the other methods. We leverage ADAM to optimize the parameters. In both steps, we use constant step-sizes and mini-batch Markovian sampling. |