Accelerated Sparse Linear Regression via Random Projection

Authors: Weizhong Zhang, Lijun Zhang, Rong Jin, Deng Cai, Xiaofei He

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiment Study Having analysed the performance of the proposed method in the theoretical perspective, now we turn to the empirical study that how the proposed method behaves. We follow the work (Xiao and Zhang 2013) and evaluate the performance of our method mainly on two aspects: (i) The speed of the learned vector βt converging to the optimal solution β . (ii) The sparsity of the learned vector βT . And two baseline algorithms below will be used in our study. ADG ((Nesterov 2013), Algorithm 4.9): Nesterov s accelerated dual gradient method. PGH (Xiao and Zhang 2013): a state-of-art algorithm yields a global geometric convergence rate. SLvia RP: the proposed method in our study. Experiments on the Synthesized Datasets Dataset We generate the data matrix X, the response vector y and the optimal solution β in the following way: 1) For the data matrix X, we generate it by adding a Gaussian noise e X to a low rank matrix X0, i.e. X = X0 + e X, where e X Rd N, [e X]ij N(0, σ2 e X). For generating the matrix X0, we first construct a random matrix U Rd N, each Uij is sampled from an uniform distribution U[ 1,1]. Then, we calculate a QR factorization for U, i.e. U = QR, Q Rd N, R RN N. We extract the first k columns from Q to form a new matrix ˆQ. At last, we get the low rank matrix X0 by projecting U onto ˆQ, i.e., X0 = ˆQ ˆQT U. 2) For the sparse vector β Rd, we select s components from its d coordinates randomly and set them to 1. Then we set all the rest components to 0. 3) The response vector is sampled from the prediction model y = XT β + ey, where ey RN is a random noise vector sampled from a normal distribution N(0, σ2 ey IN N). Parameter setting For the training data, we set d = 10000, N = 5000, k = 500, s = 25 and vary the noise level e X, ey in [0.01, 0.02, ..., 0.05]. In ADG, Ψ(x) = 0.001||β||1, γμ = γd = 1.5, μ = 0 and L0 = 0.001. In PGH, we set λtgt = 0.001, ϵ = 10 6, Lmin = 10 6, η = 0.7, δ = 0.2 and γdec = γinc = 2. At last, for our method SLvia RP, we let γ = λ0 = 0.3, η = 0.94 and λmin = 0.002. Goal To recover the optimal sparse vector β from the training data X and y both accurately and efficiently. And also, the solutions we obtain should be as sparse as possible. Evalation metrics To evaluate the property of the learning βT , we measure the error ||βt β ||2 and its sparsity over the iterations and the running time (second). The sparsity here is computed as density = 1 d d i=1 I([βT ]i = 0). And we also measure the recovery of the support set of βT by SSR(βT ) = 2|S(βT ) S(β )|/|S(βT )|+|S(β )|. We run each experiment 100 times, and report the averaged results. Experimental results Figures 1 and 2 show the performances of different algorithms when the noise level e X = e Y = 0.01. From the left panels of both Figure 1 and 2, we observe that our method converges to the optimal solution β as fast as the recent algorithm PGH, thus their convergence rates are comparable. The right panels give us a deep impression that we can improve run-time efficiency significantly due to the lower computing complexity at each iteration. In addition, it shows that ADG is not good at dealing with such under-determined problem, since its error ||βt β ||2 decreases slowly over the iterations. It may result from the fact that its solution path is not sparse (Table 1) over the iterations (Xiao and Zhang 2013). Experiments on Real-word Dataset Experiment design To demonstrate the efficiency of our methods further, we conduct a regression experiment on the well-known dataset MNIST, comprised of the gray images of scanned handwritten digits. It has roughly 60000 training samples and 10000 testing samples. The dimension of each sample is 28 28. We randomly sample 10,000 examples from the training set and get a data matrix X R10,000 784. At last, we obtain the response vector y R784 by sampling an image from the testing set. Our goal is to find a sparse vector β which can approximate y with XT β accurately. Parameter setting In ADG, Ψ(x) = 0.0005||β||1, γμ = γd = 1.2, μ = 0 and L0 = 0.001. In PGH, we set λtgt = 0.005, ϵ = 0.001, Lmin = 0.005, η = 0.7, δ = 0.7 and γdec = γinc = 2. For our method SLvia RP, we let k = 100, γ = 10, λ0 = 0.2, η = 0.97 and λmin = 0.005. At last, we run 100 trials and report the averaged performance of each method after 1000 iterations. Evalation metrics Since the optimal solution β is unknown, we use the metrics like residual, sparsity and running time to evaluate the performance of our method. And summarize the numerical results in Table 2.
Researcher Affiliation Collaboration State Key Lab of CAD&CG, College of Computer Science, Zhejiang University, Hangzhou, China National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China Alibaba Group, Seattle, USA {zhangweizhongzju, dengcai, xiaofeihe}@gmail.com, zhanglj@lamda.nju.edu.cn, jinrong.jr@alibaba-inc.com
Pseudocode Yes Algorithm 1 Accelerated Sparse Linear Regression via Random Projection (SLRvia RP) 1: Input: the data matrix X, y, λ0, λmin, γ, k, η 2: // Compute the approximation X: 3: Sample a d k random matrix Z with Zij N(0, 1) 4: Compute the QR decomposition of XT Z, i.e., XT Z = QR 5: Approximate X by X = (XQ)QT = W T QT 6: // Recover the sparse vector β: 7: Initialize β0 = 0 8: for t = 0 to T 1 do 9: Update: βt+1 = min β Rd n(β βt)T X(y XT βt) +λt|β|1 + γ 2 |β βt|2 2 where λt = max(λmin, λ0ηt). 10: end for 11: Return: βT
Open Source Code No No explicit statement about releasing source code or a link to a code repository was found.
Open Datasets Yes Experiments on Real-word Dataset Experiment design To demonstrate the efficiency of our methods further, we conduct a regression experiment on the well-known dataset MNIST, comprised of the gray images of scanned handwritten digits. It has roughly 60000 training samples and 10000 testing samples. The dimension of each sample is 28 28. We randomly sample 10,000 examples from the training set and get a data matrix X R10,000 784.
Dataset Splits No The paper mentions 'training samples' and 'testing samples' for MNIST, but does not specify a separate 'validation' set or explicit percentages for training, validation, and testing splits used in their experiments. It only indicates that MNIST has existing train/test splits.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were mentioned.
Software Dependencies No No specific software dependencies with version numbers were provided.
Experiment Setup Yes Parameter setting For the training data, we set d = 10000, N = 5000, k = 500, s = 25 and vary the noise level e X, ey in [0.01, 0.02, ..., 0.05]. In ADG, Ψ(x) = 0.001||β||1, γμ = γd = 1.5, μ = 0 and L0 = 0.001. In PGH, we set λtgt = 0.001, ϵ = 10 6, Lmin = 10 6, η = 0.7, δ = 0.2 and γdec = γinc = 2. At last, for our method SLvia RP, we let γ = λ0 = 0.3, η = 0.94 and λmin = 0.002. ... Parameter setting In ADG, Ψ(x) = 0.0005||β||1, γμ = γd = 1.2, μ = 0 and L0 = 0.001. In PGH, we set λtgt = 0.005, ϵ = 0.001, Lmin = 0.005, η = 0.7, δ = 0.7 and γdec = γinc = 2. For our method SLvia RP, we let k = 100, γ = 10, λ0 = 0.2, η = 0.97 and λmin = 0.005. At last, we run 100 trials and report the averaged performance of each method after 1000 iterations.