Scaling laws for learning with real and surrogate data
Authors: Ayush Jain, Andrea Montanari, Eren Sasoglu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains. |
| Researcher Affiliation | Collaboration | Ayush Jain1 ayush.jain@granica.ai Andrea Montanari1,2 andrea.montanari@granica.ai Eren Sasoglu1 eren.sasoglu@granica.ai 1 Granica Computing Inc. granica.ai 2Stanford University |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. It describes mathematical formulations and experimental procedures in text. |
| Open Source Code | No | The paper states: 'All the datasets used are public datasets. The results can be reproduced using the details we provided.' However, it does not explicitly state that the authors' own source code for the methodology is released or provide a link to it. |
| Open Datasets | Yes | We carry out experiments with the following data sources. (1) Simulated data... (2) Real natural language processing (NLP) data... IMDB reviews... Rotten Tomatoes review and Goodreads book reviews. (3) Progression-free survival analysis using Lasso on TCGA Pan Cancer dataset. (4) Real image classification data, with CIFAR-10 and CIFAR-100 datasets... All of these datasets are explicitly linked or cited in Section F 'Datasets information'. |
| Dataset Splits | Yes | We split the original dataset into train, test, and validation sets, while all examples in the surrogate datasets are allocated solely to the train split. |
| Hardware Specification | Yes | We ran all experiments on a single machine with 2 RTX 4090 GPUs and a 24-core Intel Xeon E5 CPU. |
| Software Dependencies | No | The paper mentions software like 'scikit-learn implementation', 'nltk tagger', 'Paraphrase-Mini LM-L6-v2 sentence transformer', 'Cox PHFitter model', and 'Adam for optimization', but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Logistic regression: We use the scikit-learn implementation with the lbfgs solver, fitting the intercept, with maximum iterations set to 10k. For each run of each (n, m, α) combination, we set the ℓ2 penalty (parameter C in scikit-learn) to 2i, i = 8, ..., 8 and 10i, i = 6, 5, 4, 3, 3, 4, 5, 6, and only report the test result for the value that achieves the best validation error. Neural network: The network has one hidden layer with 32 Re LU neurons, and an output neuron using sigmoid. For training, we use the binary cross entropy loss, a constant learning rate of 0.05, and batch size 64. We train the network for 1,000 epochs. Similar to the procedure in logistic regression, we use ℓ2 regularization (weight decay) and use the validation set to choose the best regularization parameter from the set {0, 10 5, 10 4, 10 3, 2 10 3, 4 10 3, 10 2, 2 10 2, 4 10 2, 10 1, 2 10 1, 4 10 1}. |