A Language Model’s Guide Through Latent Space

Authors: Dimitri Von Rütte, Sotiris Anagnostidis, Gregor Bachmann, Thomas Hofmann

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments reveal that while some concepts such as truthfulness more easily allow for guidance with current techniques, novel concepts such as appropriateness or humor either remain difficult to elicit, need extensive tuning to work, or even experience confusion. We perform an extensive series of experiments and uncover that (1) some concepts such as truthfulness are very robustly guidable, in agreement with prior work, but (2) novel concepts such as humor need extensive tuning for guidance to be successful while (3) appropriateness remains impossible to elicit, resulting in concept confusion with compliance. We further observe that probes with optimal detection accuracies do not necessarily make for the optimal guides, contradicting previous observations for truthfulness.
Researcher Affiliation Academia Dimitri von R utte 1 Sotiris Anagnostidis 1 Gregor Bachmann 1 Thomas Hofmann 1 1Data Analytics Lab, Department of Computer Science, ETH Zurich.
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper. Figures are present, but they are not pseudocode.
Open Source Code No The paper states: “We plan to release these models publicly, which we hope will greatly decrease the cost of running evaluations on this benchmark.” This indicates a future plan for release rather than current concrete access to source code for the methodology described.
Open Datasets Yes We leverage annotated datasets from Open Assistant (K opf et al., 2023) and further contribute a synthetic dataset of our own for appropriateness. Our appropriateness dataset is based on real user prompts from Toxic Chat (Lin et al., 2023).
Dataset Splits No For every concept C, we use a probing dataset consisting of 512 balanced samples, which we split into a training set Dtrain and a test set Dtest using a 75/25 split. This statement only describes train and test splits, with no mention of a separate validation split or its proportion.
Hardware Specification No No specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running the experiments were provided in the paper. The paper mentions models like “Llama-2-7b-chat” and “Mistral-7B-instruct” but these are software models, not hardware specifications.
Software Dependencies No The paper mentions models such as “Mistral-7B-v0.1” and “Llama-2-7b-chat” and notes the use of “Jsoup” for an example, but it does not provide specific version numbers for programming languages, frameworks, or key libraries used in their experimental setup.
Experiment Setup Yes Namely, we apply guidance to the top k {8, 16, 24, 32} layers (determined by training accuracy) and with 31 log-spaced guidance strengths α [ 128, 128] for Llama-2 and α [ 512, 512] for Mistral.