Language models can explain neurons in language models

We use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and scores for every neuron in GPT-2.

Language models have become more capable and more broadly deployed, but our understanding of how they work internally is still very limited. For example, it might be difficult to detect from their outputs whether they use biased heuristics or engage in deception. Interpretability research aims to uncover additional information by looking inside the model.

One simple approach to interpretability research is to first understand what the individual components (neurons and attention heads) are doing. This has traditionally required humans to manually⁠(opens in a new window) inspect⁠ neurons⁠(opens in a new window) to figure out what features of the data they represent. This process doesn’t scale well: it’s hard to apply it to neural networks with tens or hundreds of billions of parameters. We propose an automated process that uses GPT-4 to produce and score natural language explanations of neuron behavior and apply it to neurons in another language model.

This work is part of the third pillar of our approach to alignment research⁠: we want to automate the alignment research work itself. A promising aspect of this approach is that it scales with the pace of AI development. As future models become increasingly intelligent and helpful as assistants, we will find better explanations.

Related Posts

Waymo stats 🚗, Microsoft AI health 🏥, physics sim for training robots 🤖

Bring Receipts: New NVIDIA AI Blueprint Detects Fraudulent Credit Card Transactions With Precision

Un ingénieur logiciel perd son emploi de 150 000 $ par an au profit de l’IA et est contraint de travailler comme livreur. Il a postulé à 800 offres d’emploi sans succès, car l’IA a réduit les perspectives

Grammarly vient d’acquérir la start-up d’e-mail Superhuman dans le cadre de sa stratégie qui va au-delà de la correction grammaticale pour créer une suite complète d’outils de productivité basés sur l’IA