Fine-grained Analysis of In-context Linear Estimation: Data, Architecture, and Beyond

Authors: Yingcong Li, Ankit Rawat, Samet Oymak

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results corroborate our theoretical findings. Overall, this work explores the optimization and risk landscape of ICL in practically meaningful settings and contributes to a more thorough understanding of its mechanics. We now conduct synthetic experiments to support our theoretical findings and further explore the behavior of different models of interest under different conditions.
Researcher Affiliation Collaboration Yingcong Li University of Michigan yingcong@umich.edu Ankit Singh Rawat Google Research NYC ankitsrawat@google.com Samet Oymak University of Michigan oymak@umich.edu
Pseudocode No The paper describes mathematical derivations and model architectures but does not include structured pseudocode or algorithm blocks.
Open Source Code No The NeurIPS checklist states 'Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: As discussed above, this paper conducts small scale synthetic experiments to corroborate our theoretical findings. We have provided sufficient details to reproduce these experiments in Section 4.'
Open Datasets No We consider meta-learning setting where task parameter β is randomly generated for each training sequence. ... We generate data according to (7) with Σx = Σβ = Id and σ = 0...
Dataset Splits No In all experiments, we set the dimension d = 20. Depending on the in-context length (n), different models are trained to make in-context predictions. We train each model for 10000 iterations with batch size 128 and Adam optimizer with learning rate 10 3. Since our study focuses on the optimization landscape, and experiments are implemented via gradient descent, we repeat 20 model trainings from different initialization and results are presented as the minimal test risk among those 20 trails.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models. It only mentions 'Our work only focuses on 1-layer attention/H3 model training with hidden dimension 21 and maximal context length < 100, which can be implemented easily.'
Software Dependencies No We train each model for 10000 iterations with batch size 128 and Adam optimizer with learning rate 10 3.
Experiment Setup Yes We train each model for 10000 iterations with batch size 128 and Adam optimizer with learning rate 10 3.