Adapting to Mixing Time in Stochastic Optimization with Markovian Data
Authors: Ron Dorfman, Kfir Yehuda Levy
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we validate the performance of our algorithm on a synthetic linear regression task. and 6. Experiments We demonstrate the performance of our approach on a synthetic linear regression problem with Markovian data. |
| Researcher Affiliation | Academia | 1Viterby Faculty of Electrical and Computer Engineering, Technion, Haifa, Israel 2A Viterbi Fellow. |
| Pseudocode | Yes | Algorithm 1 MAG (MLMC Ada Grad) |
| Open Source Code | No | The paper does not provide any explicit statement about making the source code available or a link to a code repository. |
| Open Datasets | No | We consider a symmetric 2-state Markov chain, for which the probability of transitioning between states is p (0, 1). For each state i {1, 2}, we randomly choose w i Rd and draw Xi Rn d according to N(0, I). Then, we set yi = Xiw i + ϵ, where ϵ is a random noise vector N(0, 10 3). |
| Dataset Splits | No | The paper describes generating a synthetic dataset for experiments but does not provide explicit details about train/validation/test splits, as it appears to be a continuous simulation rather than a partitioned dataset. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | In practice, we observed that the multiplicative 2J factor in the MLMC estimator resulted in high magnitude gradients which in turn significantly reduced the learning rate of Ada Grad and slowed down future progress. Therefore, instead of drawing J Geom( 1 2), we used a truncated geometric distribution such that P(J = j) 2 j, j [K] for some fixed K. We used K = 5, which was not optimally tuned. In Figure 1 we present the performance of the methods for n = 250, d = 100 and p = 10 4. |