Language models can explain neurons in language models

language models can explain neurons in language model

We use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and scores for every neuron in GPT-2.

Language models have become more capable and more broadly deployed, but our understanding of how they work internally is still very limited. For example, it might be difficult to detect from their outputs whether they use biased heuristics or engage in deception. Interpretability research aims to uncover additional information by looking inside the model.

One simple approach to interpretability research is to first understand what the individual components (neurons and attention heads) are doing. This has traditionally required humans to manually⁠(opens in a new window) inspect⁠ neurons⁠(opens in a new window) to figure out what features of the data they represent. This process doesn’t scale well: it’s hard to apply it to neural networks with tens or hundreds of billions of parameters. We propose an automated process that uses GPT-4 to produce and score natural language explanations of neuron behavior and apply it to neurons in another language model.

This work is part of the third pillar of our approach to alignment research⁠: we want to automate the alignment research work itself. A promising aspect of this approach is that it scales with the pace of AI development. As future models become increasingly intelligent and helpful as assistants, we will find better explanations.