Non-stationary Projection-Free Online Learning with Dynamic and Adaptive Regret Guarantees
Authors: Yibo Wang, Wenhao Yang, Wei Jiang, Shiyin Lu, Bing Wang, Haihong Tang, Yuanyu Wan, Lijun Zhang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments In this section, we present experimental results that verify our theoretical findings in dynamic regret. Empirical studies on adaptive regret can be found in the full version (Wang et al. 2023b). Setup. To evaluate our methods (i.e. BOGDIP, POLD and POLA) in dynamic regret minimization, we study the problem of online matrix completion, of which the goal is to produce a matrix X from the trace norm ball in an online fashion to approximate the target matrix M Rm n. Specifically, in each round t, the learner receive a sampled data (i, j) with the value Mij from the entry set OB of M. Then, the learner chooses X from the trace norm ball K = {X| X δ, X Rm n} where δ is the parameter, and suffers the online loss ft(X) = P (i,j) OB |Xij Mij|. We conduct the experiments with δ = 104 on the public dataset: Movie Lens 100K3, which contains 100000 ratings from 943 users on 1682 movies. Following Wan, Xue, and Zhang (2021), we slightly modify the dataset to simulate the non-stationary environments. Concretely, we generate an extended datasets {(ik, jk, Mikjk)}300000 k=1 by merging three copies of Movie Lens 100K. For entries corresponding to k = 100001, , 200000, we negate the original values Mikjk to obtain Mikjk. For simplicity, we divide the extended datasets into T = 3000 partitions. In this way, the target matrix M drifts every 1000 rounds. Contenders. We choose the projection-free algorithm: Multi-OCG (Wan, Xue, and Zhang 2021), and the projection-based algorithm: Ader (Zhang et al. 2018) as the contenders. All parameters of each method are set according to the theoretical suggestions. For instance, the learning rate of the i-th expert is set as ηi = c(2i 1) 1/2 in Multi OCG, and ηi = c2i 1T 1/2 in Ader, and ηi = c2i 1T 3/4 in POLD, where c is the hyper-parameters selected from {10 1, 100, , 106}. Results. We report the average instantaneous loss, the cumulative loss and the runtime (in seconds) against the number of rounds for each method in Figure 2. As evident from the results, projection-free methods are significantly more efficient compared to the projection-based approach (i.e. Ader), albeit with a slight compromise on cumulative loss. This observation is reasonable in the sense that (i) the cost of linear optimization over the trace norm ball is O(nnz(X)) whereas projection operation suffers a much higher O(mn2) cost; (ii) our methods ensure an O(T 3/4(1+PT )1/4) bound against the O( p T(1 + PT )) bound of Ader. Moreover, owing to the inherent advantage in minimizing the general-case dynamic regret, our methods yield a lower cumulative loss compared to the projection-free contender Multi-OCG. |
| Researcher Affiliation | Collaboration | 1 National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China 2 School of Artificial Intelligence, Nanjing University, Nanjing 210023, China 3 Alibaba Group, Hangzhou 310052, China 4 School of Software Technology, Zhejiang University, Ningbo 315048, China, |
| Pseudocode | Yes | Algorithm 1: Blocked Online Gradient Descent with Infeasible Projections (BOGDIP) Input: Number of rounds T, domain set K, step size η, infeasible projection oracle OIP Initialization: Choose arbitrary point x1 K and set y1 = x1, m = 1, block size K = η 2/3 and error tolerance ϵ = η2/3. 1: for t = 1 to T do 2: Submit xt = xm, observe ft(xt) and obtain ft(xt) 3: if t mod K = 0 then 4: Update ym+1 according to (10) 5: Set xm+1, ym+1 according to (11), and m = t/K + 1 6: end if 7: end for |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code for the methodology described in this paper. |
| Open Datasets | Yes | We conduct the experiments with δ = 104 on the public dataset: Movie Lens 100K3, which contains 100000 ratings from 943 users on 1682 movies. |
| Dataset Splits | No | The paper mentions generating an extended dataset and dividing it into partitions but does not specify training, validation, or test splits in percentages or absolute counts. It states: "For simplicity, we divide the extended datasets into T = 3000 partitions." |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers). |
| Experiment Setup | Yes | All parameters of each method are set according to the theoretical suggestions. For instance, the learning rate of the i-th expert is set as ηi = c(2i 1) 1/2 in Multi OCG, and ηi = c2i 1T 1/2 in Ader, and ηi = c2i 1T 3/4 in POLD, where c is the hyper-parameters selected from {10 1, 100, , 106}. |