reproducibilityindex.ai

Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation

Authors: Randall Balestriero, Romain Cosentino, Sarath Shekkizhar

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in LLMs. Code: https://github.com/ Randall Balestriero/Spline LLM
Researcher Affiliation	Collaboration	1Brown University, Computer Science Department 2Tenyx. Correspondence to: Randall Balestriero <rbalestr@brown.edu>, Romain Cosentino <romain@tenyx.com>, Sarath Shekkizhar <sarath@tenyx.com>.
Pseudocode	Yes	Listing 1. Code to use with the Llama Attention class in the modelling llama.py file of the Transformers package to obtain intrinsic dimension IDℓ ϵ(i) from Section 3.3; Listing 2. Code to use with the Llama MLP class in the modelling llama.py file of the Transformers package to obtain Eqs. (feature 1) to (feature 7).
Open Source Code	Yes	Code: https://github.com/ Randall Balestriero/Spline LLM
Open Datasets	Yes	Omni-Toxic Datasets: We use for the non-toxic samples: the concatenation of the subsampled (20, 000 samples) Pile validation dataset, with the questions from the Dolly Q&A datasets, as well as the non-toxic samples from the Jigsaw dataset (Adams et al., 2017). For the toxic samples: we use the toxic samples from the Jigaw dataset, concatenated with our hand-crafted toxic-pile dataset... Toxigen dataset (Hartvigsen et al., 2022).
Dataset Splits	No	The training procedure consists of using 70% of the dataset as the training set and evaluating the performance on the held-out 30% of the data. No explicit separate validation split is mentioned.
Hardware Specification	No	The paper mentions 'compute limitations' but does not specify the exact hardware used for running experiments (e.g., specific GPU/CPU models, memory details).
Software Dependencies	Yes	Our experiments are performed using the Llama2-7B model and its tokenizer ( meta-llama/Llama-2-7b-chat-hf ) available via the transformer package (v4.31.0).
Experiment Setup	Yes	Each sample is truncated to 1024-context length to accommodate for our compute limitations. ... No cross-validation is employed for hyper-parameter selection, and default parameters of the logistic regression and the random forest models from sklearn are used.