Improving Adaptivity via Over-Parameterization in Sequence Models
Authors: Yicheng Li, Qian Lin
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we provide some numerical experiments to validate the theoretical results. For more detailed numerical experiments, please refer to Section C. |
| Researcher Affiliation | Academia | Yicheng Li Department of Statistics and Data Science Tsinghua University, Beijing, China liyc22@mails.tsinghua.edu.cn Qian Lin Department of Statistics and Data Science Tsinghua University, Beijing, China qianlin@tsinghua.edu.cn Corresponding author Qian Lin also affiliates with Beijing Academy of Artificial Intelligence, Beijing, China |
| Pseudocode | No | We approximate the gradient flow equation (22) and (30) by discrete-time gradient descent and truncate the sequence model to the first N terms for some very large N. |
| Open Source Code | Yes | The codes are provided in the supplementary material. |
| Open Datasets | No | We consider the settings as in Corollary 3.3 that θ is given by (4) for some p > 0 and q 1. We set ϵ2 = n 1, where n can be regarded as the sample size, and consider the asymptotic performance of the generalization error as n grows. [...] We consider the two real-world datasets: California Housing and Concrete Compressive Strength. |
| Dataset Splits | No | No explicit mention of training/validation/test dataset splits is found. The paper focuses on generalization error related to training process and sample size. |
| Hardware Specification | Yes | The experiments can be done by a 64 CPU core laptop with 32 GB memory in one day. |
| Software Dependencies | No | The paper mentions 'discrete-time gradient descent' and implies computation, but does not specify software names with version numbers for libraries or programming languages used. |
| Experiment Setup | Yes | We approximate the gradient flow equation (22) and (30) by discrete-time gradient descent with sufficiently small step size. Moreover, we truncate the sequence model to the first N terms for some very large N. We consider the settings as in Corollary 3.3 that θ is given by (4) for some p > 0 and q 1. We set ϵ2 = n 1, where n can be regarded as the sample size, and consider the asymptotic performance of the generalization error as n grows. For the stopping time, we choose the oracle one that minimizes the generalization error for each method. |