Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scalable High-Dimensional Multivariate Linear Regression for Feature-Distributed Data
Authors: Shuo-Chieh Huang, Ruey S. Tsay
JMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The fast convergence of TSRGA is validated by simulation experiments. Finally, we apply the proposed TSRGA in a financial application that leverages unstructured data from the 10-K reports, demonstrating its usefulness in applications with many dense large-dimensional matrices. To validate the performance of TSRGA, we apply it to both synthetic and real-world data sets and show that TSRGA converges much faster than other existing methods. In the simulation experiments, TSRGA achieved the smallest estimation error using the least number of iterations. |
| Researcher Affiliation | Academia | Shuo-Chieh Huang EMAIL Ruey S. Tsay EMAIL Booth School of Business University of Chicago Chicago, IL 60637, USA |
| Pseudocode | Yes | Algorithm 1: Feature-distributed relaxed greedy algorithm (RGA) Algorithm 2: Feature-distributed second-stage RGA |
| Open Source Code | No | The paper mentions using third-party tools like 'Open MPI' and 'mpi4py', and 'glmnet package in R', but does not provide concrete access or a statement for the specific source code of the TSRGA methodology described in this paper. |
| Open Datasets | Yes | All series are obtained from Yahoo! Finance via the tidyquant package in R. The corpus utilized in this application is sourced from the EDGAR-CORPUS, originally prepared by Loukas et al. (2021). |
| Dataset Splits | Yes | As a benchmark, we also solve the Lasso problem with 5-fold cross validation using glmnet package in R. For TSRGA, we simply set Ln = 500 and tn = 1/(10 log n), and the performance is not too sensitive to these choices. For Specifications 1 and 2 below, we consider three cases with (n, pn) {(800, 1200), (1200, 2000), (1500, 3000)}. For TSRGA, Ln is set to 105, and we hold one third of the training data as validation set to select the tuning parameter tn for TSRGA over a grid of values. We reserved the last year of data [for the test set]. |
| Hardware Specification | Yes | The algorithm runs on the high-performance computing cluster of the university, which comprises multiple computing nodes equipped with Intel Xeon Gold 6248R processors. |
| Software Dependencies | No | The paper mentions 'Open MPI and the Python binding mpi4py (Dalcın et al., 2005; Dalcın and Fang, 2021)', 'glmnet package in R', and 'gensim package in Python'. However, specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | The step size of the Hydra-type algorithms is set to the lowest value so that we observe convergence of the algorithms instead of divergence. For TSRGA, we simply set Ln = 500 and tn = 1/(10 log n), and the performance is not too sensitive to these choices. For Specifications 1 and 2 below, we consider three cases with (n, pn) {(800, 1200), (1200, 2000), (1500, 3000)}. For TSRGA, Ln is set to 105, and we hold one third of the training data as validation set to select the tuning parameter tn for TSRGA over a grid of values1. tn is selected among t = (0.01, 0.07, 1.10, 1.39, 1.61, 1.79, 1.95, 2.08, 2.20, 2.30)/ log n. |