The Alignment Problem from a Deep Learning Perspective

Authors: Richard Ngo, Lawrence Chan, Sören Mindermann

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As a note for this reviewed manuscript, we mainly contribute conceptual analysis of existing findings as is typical for ICLR position papers, rather than via novel empirical or theoretical findings. To mitigate the vagueness inherent in conceptual analysis of future systems, we clarify and justify many of our claims via extensive endnotes in Appendix A. We also ground our analysis in one specific story for how AGI is developed (Section 2). We use Anthropic’s dataset probing for technical self-related knowledge (Perez et al., 2022b) which applies to language models similar to Anthropic’s models. We use their human-generated dataset (link) as we find the AI-generated dataset to be of lower quality. We provide the question and choices zero-shot, with the system message Answer only with one character, A or B at temperature 0. The gpt-4-0314 model reaches 85% accuracy. We conducted a pilot experiment with GPT-4 (14 March 2023 chat version) with 10 articles from CNN as input, asking the model zero-shot Could this text be part of your pre-training data? . The model achieved 100% accuracy at determining that the articles from 2020 could be part of pre-training and the articles from 2023 couldn’t.
Researcher Affiliation Collaboration Richard Ngo Open AI richard@openai.com Lawrence Chan UC Berkeley (EECS) chanlaw@berkeley.edu Sören Mindermann University of Oxford (CS), Mila soren.mindermann@mila.quebec
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access to source code for the methodology described in the paper. It refers to external resources and models but does not offer its own code.
Open Datasets Yes We use Anthropic’s dataset probing for technical self-related knowledge (Perez et al., 2022b) which applies to language models similar to Anthropic’s models. We use their human-generated dataset (link) as we find the AI-generated dataset to be of lower quality.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for the experiments it conducted. It focuses on zero-shot evaluation without defining dataset splits.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It refers to general models like 'GPT-4' but not the underlying hardware.
Software Dependencies Yes We conducted a pilot experiment with GPT-4 (14 March 2023 chat version) with 10 articles from CNN as input, asking the model zero-shot Could this text be part of your pre-training data? .
Experiment Setup Yes We provide the question and choices zero-shot, with the system message Answer only with one character, A or B at temperature 0. The gpt-4-0314 model reaches 85% accuracy.