Improving Adaptivity via Over-Parameterization in Sequence Models

Authors: Yicheng Li, Qian Lin

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we provide some numerical experiments to validate the theoretical results. For more detailed numerical experiments, please refer to Section C.
Researcher Affiliation Academia Yicheng Li Department of Statistics and Data Science Tsinghua University, Beijing, China liyc22@mails.tsinghua.edu.cn Qian Lin Department of Statistics and Data Science Tsinghua University, Beijing, China qianlin@tsinghua.edu.cn Corresponding author Qian Lin also affiliates with Beijing Academy of Artificial Intelligence, Beijing, China
Pseudocode No We approximate the gradient flow equation (22) and (30) by discrete-time gradient descent and truncate the sequence model to the first N terms for some very large N.
Open Source Code Yes The codes are provided in the supplementary material.
Open Datasets No We consider the settings as in Corollary 3.3 that θ is given by (4) for some p > 0 and q 1. We set ϵ2 = n 1, where n can be regarded as the sample size, and consider the asymptotic performance of the generalization error as n grows. [...] We consider the two real-world datasets: California Housing and Concrete Compressive Strength.
Dataset Splits No No explicit mention of training/validation/test dataset splits is found. The paper focuses on generalization error related to training process and sample size.
Hardware Specification Yes The experiments can be done by a 64 CPU core laptop with 32 GB memory in one day.
Software Dependencies No The paper mentions 'discrete-time gradient descent' and implies computation, but does not specify software names with version numbers for libraries or programming languages used.
Experiment Setup Yes We approximate the gradient flow equation (22) and (30) by discrete-time gradient descent with sufficiently small step size. Moreover, we truncate the sequence model to the first N terms for some very large N. We consider the settings as in Corollary 3.3 that θ is given by (4) for some p > 0 and q 1. We set ϵ2 = n 1, where n can be regarded as the sample size, and consider the asymptotic performance of the generalization error as n grows. For the stopping time, we choose the oracle one that minimizes the generalization error for each method.