Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation
Authors: Randall Balestriero, Romain Cosentino, Sarath Shekkizhar
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in LLMs. Code: https://github.com/ Randall Balestriero/Spline LLM |
| Researcher Affiliation | Collaboration | 1Brown University, Computer Science Department 2Tenyx. Correspondence to: Randall Balestriero <EMAIL>, Romain Cosentino <EMAIL>, Sarath Shekkizhar <EMAIL>. |
| Pseudocode | Yes | Listing 1. Code to use with the Llama Attention class in the modelling llama.py file of the Transformers package to obtain intrinsic dimension IDℓ ϵ(i) from Section 3.3; Listing 2. Code to use with the Llama MLP class in the modelling llama.py file of the Transformers package to obtain Eqs. (feature 1) to (feature 7). |
| Open Source Code | Yes | Code: https://github.com/ Randall Balestriero/Spline LLM |
| Open Datasets | Yes | Omni-Toxic Datasets: We use for the non-toxic samples: the concatenation of the subsampled (20, 000 samples) Pile validation dataset, with the questions from the Dolly Q&A datasets, as well as the non-toxic samples from the Jigsaw dataset (Adams et al., 2017). For the toxic samples: we use the toxic samples from the Jigaw dataset, concatenated with our hand-crafted toxic-pile dataset... Toxigen dataset (Hartvigsen et al., 2022). |
| Dataset Splits | No | The training procedure consists of using 70% of the dataset as the training set and evaluating the performance on the held-out 30% of the data. No explicit separate validation split is mentioned. |
| Hardware Specification | No | The paper mentions 'compute limitations' but does not specify the exact hardware used for running experiments (e.g., specific GPU/CPU models, memory details). |
| Software Dependencies | Yes | Our experiments are performed using the Llama2-7B model and its tokenizer ( meta-llama/Llama-2-7b-chat-hf ) available via the transformer package (v4.31.0). |
| Experiment Setup | Yes | Each sample is truncated to 1024-context length to accommodate for our compute limitations. ... No cross-validation is employed for hyper-parameter selection, and default parameters of the logistic regression and the random forest models from sklearn are used. |