reproducibilityindex.ai

Learning Partitions from Context

Authors: Simon Buchholz

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this paper, we study the problem of learning the structure of a discrete set of N tokens based on their interactions with other tokens. We focus on a setting where the tokens can be partitioned into a small number of classes, and there exists a real-valued function f defined on certain sets of tokens. This function, which captures the interactions between tokens, depends only on the class memberships of its arguments. The goal is to recover the class memberships of all tokens from a finite number of samples of f. We begin by analyzing this problem from both complexity-theoretic and information-theoretic viewpoints. We prove that it is NP-complete in general, and for random instances, we show that on the order of N ln(N) samples, implying very sparse interactions, suffice to identify the partition. We then investigate the conditions under which gradient flow dynamics of token embeddings can reveal the class structure, finding that this is achievable in certain settings when given on the order of N 2 ln2(N) samples.
Researcher Affiliation	Academia	Simon Buchholz Department for Empirical Inference Max Planck Institute for Intelligent Systems Tübingen AI Center Tübingen, Germany sbuchholz@tue.mpg.de
Pseudocode	No	The paper discusses how the problem can be reformulated as a constraint satisfaction problem and mentions using SAT solvers, stating: 'We outline the construction in more detail in Appendix C.' It does not provide explicit pseudocode or an algorithm block.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets	No	The paper introduces a theoretical framework and uses abstract 'samples' of a function 'f'. It does not refer to a publicly available or open dataset used for empirical training.
Dataset Splits	No	The paper is theoretical and describes simulations using model parameters, not empirical experiments with specified training, validation, and test splits.
Hardware Specification	No	The paper is theoretical and presents simulations to illustrate concepts, but it does not specify any particular hardware used for these simulations or analyses.
Software Dependencies	No	The paper is theoretical and focuses on mathematical analysis and theoretical properties. It does not mention any specific software dependencies with version numbers.
Experiment Setup	Yes	Figure 2: Simulation of the setting in Theorem 5 with N = 1000, K = I = 3, D = 2, λ = 0, S = 100.000. (left) trajectories of 50 randomly sampled tokens from 6 different classes. (right) Average distance of token embeddings within a class for different classes (colored) and average distance between all pairs of embeddings (black).