Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Spectral Regularization Allows Data-frugal Learning over Combinatorial Spaces

Authors: Amirali Aghazadeh, Nived Rajaraman, Tony Tu, Kannan Ramchandran

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Complementing our theory, we empirically demonstrate that running gradient descent on the regularized loss results in a better generalization performance compared to baseline algorithms in several data-scarce real-world problems. ... Empirical demonstrations on several real-world data sets complement our theoretical findings.
Researcher Affiliation Academia Amirali Aghazadeh EMAIL School of Electrical and Computer Engineering Georgia Institute of Technology; Nived Rajaraman EMAIL Department of Electrical Engineering & Computer Science University of California, Berkeley; Tony Tu EMAIL School of Computer Science Georgia Institute of Technology; Kennan Ramchandran EMAIL Department of Electrical Engineering & Computer Sciences University of California, Berkeley
Pseudocode No The paper describes methods and algorithms in natural language and mathematical notation but does not contain a dedicated pseudocode block or algorithm section with structured steps.
Open Source Code No The paper does not contain an unambiguous statement or a direct link to a source-code repository for the methodology described. It only provides a link to the Open Review forum for the paper: https: // openreview. net/ forum? id= my Si FHCe Al& note Id= Whfp RCk8Wz
Open Datasets Yes Protein is a dataset which measures the fluorescence level of 213 protein sequences that link two variants of the Entacmaea quadricolor proteins different at exactly 13 amino acids (Poelwijk et al., 2019). T cell is a dataset which measures the DNA repair outcome of T cells (average length of deletions) on 1521 sites on human genome after applying double-strand breaks (DSBs) using CRISPR (Leenay et al., 2019). Cancer is a similar dataset on 287 sites on cancer genome (Leenay et al., 2019).
Dataset Splits Yes Following the low-n experimental setting in (Biswas et al., 2021), we use a subset of 30 sequences drawn uniformly at random for training and validation and use the rest for testing. We repeat each experiments 10 with independent random splits of the data and report the RMSE in predicting the phenotype.
Hardware Specification No The paper mentions network architectures like 'depth-4 fully connected network' and 'depth-4 CNN' but does not provide specific hardware details such as GPU/CPU models or memory specifications used for experiments.
Software Dependencies No We use the default hyperparamters in scikit-learn for the baselines. While 'scikit-learn' is mentioned, a specific version number is not provided, and no other software dependencies with versions are listed.
Experiment Setup Yes We initialize DNNs using Xavier (equal seeds). We use a depth-4 fully connected network (learning rate= 1 10 1) to train on the fluorescence protein (Poelwijk et al., 2019) dataset... Fig. 2 shows the results for Xavier-initialized, depth-2 FCNs... (learning rate = 5 10 3)... Fig. 2 also shows the result for a depth-4 CNN... (learning rate = 10 3). Weight initialization, refers to initialization of the network by first running 100 epochs of SGD against the unregularized MSE.