Achieving Fairness in Multi-Agent MDP Using Reinforcement Learning
Authors: Peizhong Ju, Arnob Ghosh, Ness Shroff
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Simulation experiments also demonstrate the efficacy of our approach. We have conducted experiments on randomly generated MDP environments to verify our online and offline algorithms. Please see Appendix G for details. In Fig. 1, we plot the curves of V F 1 (s1) of the optimal policy (dashed red curves) and the curves of the policy calculated by Algorithm 1 (the blue curves) for different fair objectives. In Figs. 3 and 4, each point of the offline curve is calculated by applying the offline algorithm with the data generated by Algorithm 1 at K-th epoch. |
| Researcher Affiliation | Academia | Peizhong Ju Department of ECE The Ohio State University Columbus, OH 43210, USA ju.171@osu.edu Arnob Ghosh ECE Department New Jersey Institute of Technology Newark, NJ 07102, USA arnob.ghosh@njit.edu Ness B. Shroff Department of ECE and CSE The Ohio State University Columbus, OH 43210, USA shroff.11@osu.edu |
| Pseudocode | Yes | Algorithm 1 Online Fair MARL; Algorithm 2 Policy Gradient for Max-Min Fairness |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code for the described methodology. |
| Open Datasets | No | We use a synthetic MDP. Each term of the transition probability p is i.i.d. uniformly generated between [0, 1], and then we normalize p to make sure that P s S p(s, a, s ) = 1. Every term of the true immediate reward r is i.i.d. uniformly generated between [0.15, 0.95]. Each noisy observation of an immediate reward is drawn from a uniform distribution centered at its true value within the range of 0.05 (thus all noisy observations are in [0.1, 1]). |
| Dataset Splits | No | The paper does not explicitly specify training/test/validation dataset splits. |
| Hardware Specification | No | The paper does not provide any specific hardware details used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | We use a synthetic MDP. Each term of the transition probability p is i.i.d. uniformly generated between [0, 1], and then we normalize p to make sure that P s S p(s, a, s ) = 1. Every term of the true immediate reward r is i.i.d. uniformly generated between [0.15, 0.95]. Each noisy observation of an immediate reward is drawn from a uniform distribution centered at its true value within the range of 0.05 (thus all noisy observations are in [0.1, 1]). Figs. 1 and 2 use S = A = N = 2 and H = 3. Fig. 3 uses A = 3, S = 3, N = 3, H = 4. Fig. 4 uses A = 2, S = 2, N = 3, H = 10. Figs. 5 and 6 use A = 2, S = 2, H = 3. |