Stochastic Approximation Approaches to Group Distributionally Robust Optimization
Authors: Lijun Zhang, Peng Zhao, Zhen-Hua Zhuang, Tianbao Yang, Zhi-Hua Zhou
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We have conducted experiments to evaluate our proposed algorithms. The empirical results are presented in Appendix C, and align closely with our theories. For experiments on the synthetic dataset, we will generate the random sample on the fly, according to the protocol above. For those on the Adult dataset, we will randomly select samples from ech group. In other words, Pi is defined as the empirical distribution over the data in the i-th group. We refer to the method of Sagawa et al. [2020] and our Algorithm 1 by SMD(1) and SMD(m) to underscore that they are instances of SMD with 1 sample and m samples in each iteration, respectively. We denote our Algorithm 2 as Online(1) to emphasize that it is based on techniques from online learning and uses 1 sample per iteration. We set ℓ( ; ) to be the logistic loss and utilize different methods to train a linear model. In the balanced setting, we use the max risk, i.e., maxi [m] Ri(w), as the performance measure. To estimate the risk value, we will draw a substantial number of samples, and use the empirical average to approximate the expectation. We first report the max risk with respect to the number of iterations in Fig. 1. We observe that SMD(m) is faster than Online(1), which in turn outperforms SMD(1). This observation is consistent with our theories, since their convergence rates are O( p (log m)/T), O( p m(log m)/T), and O(m p (log m)/T), respectively. Next, we plot the max risk against the number of samples consumed by each algorithm in Fig. 2. As can be seen, the curves of SMD(m) and Online(1) are very close, indicating that they share the same sample complexity, i.e., O(m(log m)/ϵ2). By contrast, SMD(1) has a much higher sample complexity, i.e., O(m2(log m)/ϵ2). We present the risk on the individual distribution in Fig. 3 and Fig. 4. |
| Researcher Affiliation | Collaboration | Lijun Zhang1,2, Peng Zhao1, Zhen-Hua Zhuang1, Tianbao Yang3, Zhi-Hua Zhou1 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 2Peng Cheng Laboratory, Shenzhen 518055, China 3Department of Computer Science and Engineering, Texas A&M University, College Station, USA {zhanglj, zhaop, zhuangzh, zhouzh}@lamda.nju.edu.cn, tianbao-yang@tamu.edu |
| Pseudocode | Yes | Algorithm 1 Stochastic Mirror Descent for GDRO Algorithm 2 Non-oblivious Online Learning for GDRO Algorithm 3 Stochastic Mirror Descent for Weighted GDRO Algorithm 4 Stochastic Mirror-Prox Algorithm for Weighted GDRO |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the methodology described, nor does it provide any links to a code repository. |
| Open Datasets | Yes | We also use the Adult dataset [Becker and Kohavi, 1996], which includes attributes such as age, gender, race, and educational background of 48842 individuals. |
| Dataset Splits | No | The paper mentions selecting samples for 'estimating the risk of each group' and simulating an 'imbalanced setting', but it does not specify explicit train/validation/test dataset splits (e.g., percentages, sample counts, or references to predefined splits) for model training. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software components, libraries, or programming languages used in the experiments. |
| Experiment Setup | No | The paper states that they 'set ℓ( ; ) to be the logistic loss and utilize different methods to train a linear model,' and describes how synthetic data is generated. It also mentions the number of groups (m=20 or m=6) and total iterations (nm, n1, nm/2). However, it does not specify concrete hyperparameter values such as learning rates, batch sizes, number of epochs, or optimizer settings for the training process of the linear model in its experimental setup. |