Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sparse Modal Regression with Mode-Invariant Skew Noise

Authors: Kazuki Koyama, Takayuki Kawashima, Hironori Fujisawa

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments on artificial and real-world data demonstrate that the proposed method performs significantly better and is more stable than other existing methods for various skew-noise data. In Section 4, numerical experiments on artificial and real-world data are demonstrated.
Researcher Affiliation Academia Kazuki Koyama EMAIL The Graduate University for Advanced Studies (SOKENDAI) Takayuki Kawashima EMAIL Tokyo Institute of Technology Hironori Fujisawa EMAIL The Institute of Statistical Mathematics The Graduate University for Advanced Studies (SOKENDAI) RIKEN Center for Advanced Intelligence Project
Pseudocode Yes Algorithm 1 Optimization of the proposed method Require: Hyper-parameter λ and initialized β, σ, α 1: while until convergence do 2: while until convergence (of β) do 3: for n = 1, . . . , N do 4: Calculate z n and w n as follows: 5: σ yn X n β + ρα z2n + 1 ρ2α 6: yn X n β + σ 2α zn ρα 1+ρ2α wn + z2 n wn 8: β argmin β 1+ρ2 α Nσ2(1 ρ2α)2 PN n=1 yn X n β 2 + λ β 1 9: end while 10: σ argmin σ: σ>0 ℓ(θ | DN) with L-BFGS-B (or other valid algorithm) 11: α argmin α ℓ(θ | DN) with L-BFGS (or other valid algorithm) 12: end while
Open Source Code No The paper does not contain any explicit statement about the release of source code, nor does it provide a link to a code repository.
Open Datasets Yes We applied the proposed method to the following two medical datasets: PDGFR (Platelet Derived Growth Factor Receptor) consists of N = 79 samples and P = 320 features, where the outcome is the ability to inhibit PDGFR phosphorylation (Guha & Jurs, 2004). MTP (Mel Ting Point) includes N = 274 samples and P = 1142 features, where the outcome is the melting point of drug-like compounds (Karthikeyan et al., 2005). We applied the proposed method to the Engineering Graduate Salary (EGS) prediction data (Aggarwal et al., 2016).
Dataset Splits Yes For each trial, the regularization coefficient λ was adjusted by 5-fold cross-validation based on the log-likelihood loss, in which the numbers of training and validation data for each trial were set to 400 and 100, respectively. The regularization coefficient λ was tuned by 5-fold cross-validation with 80% training and 20% validation data. We used 80% of the samples for training and the remaining 20% for testing. Then, we tuned each regularization coefficient λ with 5-fold cross-validation using 20% of the training samples (i.e., 16% of all samples) as validation data.
Hardware Specification No The paper does not contain any specific hardware details such as CPU, GPU models, or memory specifications used for running the experiments.
Software Dependencies No As shown in line 8 of Algorithm 1, we can rewrite the β update as another Lasso-type problem and then use well-known software, e.g., the sklearn.linear_model package of Python. In this paper, we employ the L-BFGS algorithm (Liu & Nocedal, 1989), which is an iterative method for solving non-linear optimization problems. In particular, for the inequality constraint σ > 0, we can utilize the L-BFGS-B algorithm (Byrd et al., 1995; Zhu et al., 1997), which extends the L-BFGS algorithm to handle bounded constraints. The paper mentions software packages like 'sklearn.linear_model' and 'scipy.optimize' (for L-BFGS) but does not specify their version numbers.
Experiment Setup Yes The sample size was set to N = 500. We conducted 50 experiments with different random seeds. For each trial, the regularization coefficient λ was adjusted by 5-fold cross-validation based on the log-likelihood loss, in which the numbers of training and validation data for each trial were set to 400 and 100, respectively. The tuning power parameter in the Yeo-Johnson transformation was determined by maximum likelihood estimation. We used 80% of the samples for training and the remaining 20% for testing. Then, we tuned each regularization coefficient λ with 5-fold cross-validation using 20% of the training samples (i.e., 16% of all samples) as validation data. These samples were generated randomly, and 30 trials were conducted with different random seeds.