Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Density estimation with LLMs: a geometric investigation of in-context learning trajectories

Authors: Toni Liu, Nicolas Boulle, Raphaël Sarfati, Christopher Earls

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate LLMs ability to perform density estimation (DE), which involves estimating the probability density function (PDF) from data observed in-context. Our core experiment is remarkably straightforward. As illustrated in Figure 1, we prompt LLMs such as LLa MA-2 (Touvron et al., 2023), Gemma (Gemma Team et al., 2024), and Mistral (Jiang et al., 2023) with a series of data points {Xi}n i=1 sampled independently and identically from an underlying distribution p(x). We then observe that the LLMs predicted PDF, ˆpn(x), for the next data point gradually converges to the ground truth as the context length n (the number of in-context data points) increases.1
Researcher Affiliation	Academia	Toni J.B. Liu Department of Physics Cornell University, USA EMAIL Raphaël Sarfati School of Civil and Environmental Engineering Cornell University, USA EMAIL Nicolas Boullé Department of Mathematics Imperial College London, UK EMAIL Christopher J. Earls Center for Applied Mathematics School of Civil and Environmental Engineering Cornell University, USA EMAIL
Pseudocode	No	The paper describes methods and steps in prose (e.g., "Our methodology consists of 5 steps..."), but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our codebase, along with a 3D visualization of an LLM s in-context learning trajectory, is publicly available at https://github.com/Antonio Liu97/LLMICL_in PCA.
Open Datasets	No	We investigate LLMs ability to perform density estimation (DE), which involves estimating the probability density function (PDF) from data observed in-context. Our core experiment is remarkably straightforward. As illustrated in Figure 1, we prompt LLMs such as LLa MA-2 (Touvron et al., 2023), Gemma (Gemma Team et al., 2024), and Mistral (Jiang et al., 2023) with a series of data points {Xi}n i=1 sampled independently and identically from an underlying distribution p(x). The paper details how target distributions are created (Gaussian, Uniform, and randomly generated via Gaussian Processes in Appendix A.9), but does not provide specific links, DOIs, or repositories for the sampled datasets used in the experiments.
Dataset Splits	No	The paper investigates in-context learning by varying the number of data points (context length 'n') provided to the LLM for density estimation, rather than using predefined training, validation, and test splits from a static dataset.
Hardware Specification	No	The paper does not provide specific details regarding the hardware (e.g., GPU models, CPU types, or memory) used to run the experiments.
Software Dependencies	No	The paper mentions using 'SciPy' for numerical optimization but does not provide its version number or any other specific software dependencies with their versions.
Experiment Setup	Yes	As illustrated in Figure 1, we prompt LLMs such as LLa MA-2 (Touvron et al., 2023), Gemma (Gemma Team et al., 2024), and Mistral (Jiang et al., 2023) with a series of data points {Xi}n i=1 sampled independently and identically from an underlying distribution p(x). We set α = 1, effectively populating each bin with one "hallucinated" data point prior to observing any data (Jeffreys, 1946). Unless otherwise noted, we use C = 1 for classical KDE in this paper. For a given DE trajectory ˆp1(x), . . . , ˆpn(x), we optimize our bespoke KDE to minimize the Hellinger distance at each context length i: min si (0, ),hi (0, ) DHel(ˆpi(x) ˆphi,si(x)). While LLa MA-2 has a context window of 4096 tokens (equivalent to 1365 comma-delimited, 2-digit data points), we limit our analysis to a context length of n = 200.