Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Inference on High-dimensional Single-index Models with Streaming Data
Authors: Dongxiao Han, Jinhan Xie, Jin Liu, Liuquan Sun, Jian Huang, Bei Jiang, Linglong Kong
JMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the performance of the proposed method, extensive simulation studies have been conducted. We provide applications to Nasdaq stock prices and financial distress data sets. |
| Researcher Affiliation | Academia | Dongxiao Han EMAIL School of Statistics and Data Science, KLMDASR, LEBPS, and LPMC Nankai University Tianjin, 300071,China Jinhan Xie EMAIL Yunnan Key Laboratory of Statistical Modeling and Data Analysis Yunnan University Kunming, 650091, China; Department of Mathematical and Statistical Sciences University of Alberta Edmonton, AB, T6G 2G1, Canada Jin Liu EMAIL School of Statistics and Data Science, KLMDASR, LEBPS, and LPMC Nankai University Tianjin, 300071,China Liuquan Sun EMAIL Academy of Mathematics and Systems Science, Chinese Academy of Sciences, and School of Mathematical Sciences University of Chinese Academy of Sciences Beijing 100190, China Jian Huang EMAIL Department of Applied Mathematics The Hong Kong Polytechnic University Hong Kong, China Bei Jiang EMAIL Department of Mathematical and Statistical Sciences University of Alberta Edmonton, AB, T6G 2G1, Canada Linglong Kong EMAIL Department of Mathematical and Statistical Sciences University of Alberta Edmonton, AB, T6G 2G1, Canada |
| Pseudocode | Yes | Algorithm 1 Online estimation for the SIMs. Input: Streaming data sets D1 . . . Ds . . ., and the tuning parameters λ1 . . . λs . . ., γ1 . . . γs . . .; 1. Calculate the offline lasso penalized estimators bβ(1) 1 , bβ(1) 2 via (2) and (3) based on D1; 2. Update n1H(1) 1 and n2H(1) 2 ; 3. for s = 2, 3, . . . , do (i). Read the current data set Ds; (ii). Calculate the online lasso penalized estimators bβ(s) 1 and bβ(s) 2 via (5) and (6); (iii). Update and store the summary statistics { bβ(s) 1 , bβ(s) 2 , Ps j=1 nj H(j) 1 , Ps j=1 nj H(j) 2 }; (iv). Calculate ˆβ (s) ave = {ˆβ (s) 1 + ˆβ (s) 2 }/2; (v). Release data set Ds from the memory; end for Output: bβ(s) ave for s = 1, 2, . . . |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a direct link to a code repository for the methodology described. |
| Open Datasets | Yes | In this section, we illustrate our method with the financial distress data set, which is available from https://www.kaggle.com/datasets/shebrahimi/financial-distress. |
| Dataset Splits | Yes | The data are split into m = 10 batches. We take the first two-year data set as the first data batch (n1 = 164) to guarantee a sufficiently large sample size at the initial stage and the next one-year data set as the subsequent data batch (nj = 82, j = 2, , m 1). In addition, the sample size of the final batch is nm = 72. Hence, the streaming data consists of m = 10 data batches with a total sample size Nm = 892. ... we split the data into m = 10 batches randomly, take the n1 = 108 observations as the first batch, and set each of the remaining 9 batches containing nj = 100 observations. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | The tuning parameters λs and γs, s = 1, . . . , m, are chosen by the modified BIC (Wang et al., 2007). For example, we obtain λs by minimizing BIC(λs) = log (ˆβ(λs) ˆβ (s 1) 2 ) s 1 X nj 2Ns H(j) 1 (ˆβ(λs) ˆβ (s 1) 2 ) i=1 l(Y (s) i , X(s) i ˆβ(λs)) + CNs log(Ns/2) where ˆβ(λs) is obtained from (5), CNs = c log log(p), c is a constant, and 0 denotes the number of nonzero elements in a vector. Furthermore, we choose the robustification parameter τ in the Huber loss such that 80% of the prediction errors are in [ τ, τ]. ... hs = arg min h Sh tr n 2H(s) 1 bΩ(s 1) 1 (h)/ns o log[det{bΩ(s 1) 1 (h)}] . |