Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Robust Methods for High-Dimensional Linear Learning
Authors: Ibrahim Merad, Stéphane Gaïffas
JMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our results through numerical experiments using synthetic data for regression and real data sets for classification (Section 6). Our experiments confirm our theoretical findings together with a comparison to other recent approaches proposed in the literature. |
| Researcher Affiliation | Academia | Ibrahim Merad EMAIL LPSM, UMR 8001, Universit e Paris Diderot, Paris, France, Stéphane Gaïffas EMAIL LPSM, UMR 8001, Universit e Paris Diderot, Paris, France and DMA, Ecole normale sup erieure |
| Pseudocode | Yes | Algorithm : Approximate Multistage Mirror Descent (AMMD) Algorithm : Approximate Multistage Dual Averaging (AMDA) |
| Open Source Code | Yes | All algorithms introduced in this paper as well as the main baselines from literature we use for comparisons are implemented and easily accessible in a few lines of code through our Python library called linlearn, open-sourced under the BSD-3 License on Git Hub and available here.3 https://github.com/linlearn/linlearn |
| Open Datasets | Yes | We also carry out experiments on real high dimensional binary classification data sets. These are referred to as gina and bioresponse and were both downloaded from openml.com. |
| Dataset Splits | No | The paper mentions "data splitting" as a proof artifact for theoretical guarantees, explicitly stating: "Note that we do not implement it later in our experimental section." For empirical evaluation, it mentions using a "validation subset of the data" for a heuristic, but does not provide specific details (percentages, counts, or predefined splits) for the overall training, validation, and testing of the models. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, memory, or cloud computing instance types) used for running the experiments. It describes the experimental setup in terms of data generation and algorithms, but not the computational environment. |
| Software Dependencies | No | The paper mentions a "Python library called linlearn" (their own) and "Scikit Learn" (a third-party library) but does not provide specific version numbers for either of them. The requirement is for specific version numbers for reproducibility. |
| Experiment Setup | Yes | We generate n covariates Xi Rd following a non-isotropic distribution with covariance matrix Σ and labels Yi = X i θ +ξi for a fixed s-sparse θ Rd and simulated noise entries ξi. The covariance matrix Σ is diagonal with entries drawn uniformly at random in [1, 10]. We use the least-squares loss ℓ(z, y) = 1/2(z y)2 in this experiment and the problem parameters are n = 500, d = 5000, s = 40 and a sparsity upper bound s = 50 is given to the algorithms instead of the real value. The noise variables ξi always follow a Pareto distribution with parameter α = 2.05. |