Regret Minimization via Saddle Point Optimization
Authors: Johannes Kirschner, Alireza Bakhtiari, Kushagra Chandak, Volodymyr Tkachuk, Csaba Szepesvari
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | E Experiments All experiments below were run on a semi-bandit problem with a 'revealing action', as alluded to in the paragraph below Lemma 4. Specifically, we assume a semi-bandit model where M = Rd and the features are ϕπ Rd. For an instance f M, the reward function is rf = ϕπ, f for all π Π. There is one revealing (sub-optimal) action ˆπ = π f . The observation for any action π = ˆπ is Mf (π) = N( ϕπ, f , 1) (29) Define Mˆπ = [ϕπ1, . . . , ϕπ|Π|] . Then the observation for action ˆπ is Mf (ˆπ) = N(Mˆπf , 1d) (30) Thus, the information for any action π = ˆπ is If(g, π) = σ2 2 ϕπ, g f 2 (31) while the information for action ˆπ is If(g, ˆπ) = σ2 2 Mˆπ(g f) 2 = σ2 π ϕπ, g f 2 (32) For this setting Estn O(d log(n)) (see Appendix A.1). E.1 Experimental Setup Our main objective is to compare our algorithm ANYTIME-E2D to the fixed-horizon E2D algorithm by Foster et al. [2021]. ANYTIME-E2D and E2D were implemented by using the procedure described in Section 3.4. Both ANYTIME-E2D and E2D need to solve the inner convex problem in Lemma 5. To do so we use Frank-Wolfe [Frank and Wolfe, 1956, Dunn and Harshbarger, 1978, Jaggi, 2013] for 100 steps and warm-starting the optimization at the solution from the previous round, µt 1. For ANYTIME-E2D we further perform a grid search over λ [0, maxg M ϵ 2decac,f(g)] (with a discretization of 50 points) to optimize over lambda within each iteration of Frank-Wolfe. For both the E2D and ANYTIME-E2D algorithm we used the version with the gaps replaced with f, since we noticed that both algorithms performed better with f. For E2D, the scale hyperparameter λ was set using λ = q n 4 log(n) as mentioned in Foster et al. [2021, Section 6.1.1]. While for ANYTIME- E2D we set the hyper-parameter ϵ2 t = d/t. Further, we compare to standard bandit algorithms: Upper Confidence Bound (UCB) and Thompson Sampling (TS) [Lattimore and Szepesvári, 2020a]. E.2 Experiment 1 In this experiment, we aim to demonstrate the advantage of having an anytime algorithm. Specifically, we tune λ in the E2D algorithm for different horizons n = 200, 500, 1000, 2000, but run it for a fixed horizon of n = 2000. As such, we expect our algorithm ANYTIME-E2D to perform better than E2D when λ was tuned for the incorrect horizons (i.e. n = 200, 500, 1000). The feature dimension is d = 3. The number of decisions is |Π| = 10. We generated the features ϕπ for each π Π and parameter f Rd randomly at the beginning and then kept them fixed throughout the experimentation. 100 independent runs were performed for each algorithm. The results of the experiment can be seen as the left plot in Fig. 1. As expected, our algorithm ANYTIME-E2D performs better than E2D (for n = 200, 500, 1000). This indicates that the E2D algorithm is sensitive to different settings of λ, which is problematic when the horizon is not known beforehand. Whereas our ANYTIME-E2D algorithm performs well even when the horizon is not known. E.3 Experiment 2 In this experiment, we investigate the case when n < d4. As pointed out below Lemma 3, we expect improvement in this regime as the regret bound of our algorithm is Rn min{d n, d1/3n2/3}, while the default, fixed-horizon E2D algorithm cannot achieve these bounds simultaneously and one has to pick one of d n or d1/3n2/3 beforehand for setting the scale hyperparameter λ. It is standard that the choice of λ is made according to the d n regret bound for E2D Foster et al. [2021](which is not optimal when n << d4), especially, if the horizon is not known beforehand. Thus, we set the horizon to n = 1000 and the dimension of the feature space to d = 30, which gives us that n = 1000 << 810000 = d4. The rest of the setup and parameters are the same as in the previous experiment except for the features ϕπ and f which are again chosen randomly in the beginning and then kept fixed throughout the experiment. The results of the experiment can be seen as the right plot in Fig. 1. As expected, our algorithm ANYTIME-E2D performs better than E2D, UCB, and TS. This indicates that indeed, ANYTIMEE2D is likely setting λ appropriately to achieve the preferred d1/3n2/3 regret rate for small horizons. The poor performance of the other algorithms can be justified, since E2D is optimized based on the worse d n regret rate (for small horizons), while the UCB and TS algorithms are not known to get regret better than d n. |
| Researcher Affiliation | Academia | Johannes Kirschner Department of Computer Science University of Alberta jkirschn@ualberta.ca Seyed Alireza Bakhtiari Department of Computer Science University of Alberta sbakhtia@ualberta.ca Kushagra Chandak Department of Computer Science University of Alberta kchandak@ualberta.ca Volodymyr Tkachuk Department of Computer Science University of Alberta vtkachuk@ualberta.ca Csaba Szepesvári Department of Computer Science University of Alberta szepesva@ualberta.ca |
| Pseudocode | Yes | Algorithm 1: ANYTIME-E2D and Algorithm 2: Exponential Weights Algorithm (EWA) for Density Estimation |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or providing links to a code repository for the described methodology. |
| Open Datasets | No | The paper describes generating synthetic data for its experiments, stating: 'We generated the features ϕπ for each π Π and parameter f Rd randomly at the beginning and then kept them fixed throughout the experimentation.' It does not provide access information for a publicly available dataset. |
| Dataset Splits | No | The paper mentions running experiments for a 'fixed horizon of n = 2000' or 'n = 1000' but does not specify any training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, or cloud computing instances) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Frank-Wolfe' and comparing against 'Upper Confidence Bound (UCB) and Thompson Sampling (TS)' algorithms, but it does not specify any software names with version numbers (e.g., Python, PyTorch, or specific solvers with their versions). |
| Experiment Setup | Yes | E.1 Experimental Setup Our main objective is to compare our algorithm ANYTIME-E2D to the fixed-horizon E2D algorithm by Foster et al. [2021]. ANYTIME-E2D and E2D were implemented by using the procedure described in Section 3.4. Both ANYTIME-E2D and E2D need to solve the inner convex problem in Lemma 5. To do so we use Frank-Wolfe [Frank and Wolfe, 1956, Dunn and Harshbarger, 1978, Jaggi, 2013] for 100 steps and warm-starting the optimization at the solution from the previous round, µt 1. For ANYTIME-E2D we further perform a grid search over λ [0, maxg M ϵ 2decac,f(g)] (with a discretization of 50 points) to optimize over lambda within each iteration of Frank-Wolfe. For both the E2D and ANYTIME-E2D algorithm we used the version with the gaps replaced with f, since we noticed that both algorithms performed better with f. For E2D, the scale hyperparameter λ was set using λ = q n 4 log(n) as mentioned in Foster et al. [2021, Section 6.1.1]. While for ANYTIME- E2D we set the hyper-parameter ϵ2 t = d/t. Further, we compare to standard bandit algorithms: Upper Confidence Bound (UCB) and Thompson Sampling (TS) [Lattimore and Szepesvári, 2020a]. The feature dimension is d = 3. The number of decisions is |Π| = 10. We generated the features ϕπ for each π Π and parameter f Rd randomly at the beginning and then kept them fixed throughout the experimentation. 100 independent runs were performed for each algorithm. |