Anytime Guarantees under Heavy-Tailed Data
Authors: Matthew J. Holland6918-6925
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we complement the preceding theoretical analysis with an application of the proposed learning strategy to real-world benchmark datasets. The practical utility of various gradient truncation mechanisms has already been well-studied in the literature (Chen, Su, and Xu 2017; Prasad et al. 2018; Lecu e, Lerasle, and Mathieu 2018; Holland and Ikeda 2019), and thus our chief point of interest here is if and when the feedback scheme utilized in Algorithm 1 can outperform the traditional feedback mechanism given by (2), under a convex, differentiable true objective. Put more succinctly, the key question is: is there a practical benefit to querying at points with guarantees? Experimental setup Considering the context of key related work (Gorbunov, Danilova, and Gasnikov 2020; Nazin et al. 2019), we focus on averaged SGD as our baseline, and consider several real-world classification datasets of varying size, using standard multi-class logistic regression as our model.2 Results and discussion Our results are summarized in Figure 1, which plots the average training and test losses. |
| Researcher Affiliation | Academia | Matthew J. Holland Osaka University |
| Pseudocode | Yes | Algorithm 1: Anytime robust online-to-batch conversion. |
| Open Source Code | Yes | A public repository including all experimental code has been published: https://github.com/feedbackward/anytime |
| Open Datasets | Yes | For CIFAR-10, we observe that the robustified version performs worse than than vanilla anytime averaged SGD; this looks to be due to the simple eh = h1 setting, and can be readily mitigated by updating eh after one pass over the data. |
| Dataset Splits | No | The paper specifies a training and test split ("the training set is of size ntr ..= 0.8n , and the test set is of size n ntr.") but does not explicitly mention a validation split. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. It only mentions the implementation environment: "Everything is implemented by hand in Python (ver. 3.8), making significant use of the numpy library (ver. 1.20)." |
| Software Dependencies | Yes | Everything is implemented by hand in Python (ver. 3.8), making significant use of the numpy library (ver. 1.20). |
| Experiment Setup | Yes | For all methods, the step size in update (17) is fixed at βt = 2/ ntr, for all steps t; this setting is appropriate for Anytime-* methods due to Corollary 7, and also for SGD-Ave based on standard results such as Nemirovski et al. (2009, Sec. 2.3). The (Gt) are obtained by direct computation of the logistic loss gradients, averaged over a mini-batch of size 8; this size was set arbitrarily for speed and stability, and no other minibatch values were tested. Furthermore, for each method and each trial, the initial value h1 is randomly generated in a dimension-wise fashion from the uniform distribution on the interval [ 0.05, 0.05]. All raw input features are normalized to the unit interval [0, 1] in a per-feature fashion. We do not do any regularization, for any method being tested. ... First, as a simple choice of anchors eh and eg, we set eh = h1 and estimate eg using the empirical mean on the training data set; ... As for the thresholds (ct) used in the Process sub-routine, we set ct = p ntr/ log(δ 1) for all t, with a confidence level of δ = 0.05 fixed throughout. |