Nehaveigur

Cleaning up AIs: The benefits of mechanistic interpretability

Probably the biggest technical problem the AI field faces is that at a fundamental level, we don’t understand what they do. Frontier AIs are black boxes. They consist of a trillion or so parameters. When you give the model an input, it breaks it down into tokens and converts each to a numerical vector. The model then transforms the vectors to produce the output. While most of the time the output is useful and makes sense, we have no idea how the embeddings relate to human-interpretable concepts. As I have written previously, a model’s internal representations are typically both fractured, meaning the same concept is encoded redundantly in separate, disconnected parts of the network rather than in one shared representation, and entangled, meaning unrelated concepts get mixed into the same neurons or directions, so that manipulating one inadvertently affects the other. This interaction may be unavoidable, since there are more concepts than independent “neurons”. Ken Stanley refers to this as fractured entangled representation. A less technical term could be that AIs resemble hairballs.

I have a robot vacuum which it does a good job cleaning and mopping the floor. However, hairballs are its nemesis, and every few days I have to remove hairballs that got stuck in its rollers. Hairballs are a tricky problem. Disentangling the hairballs is a very active area of AI research, and it’s called mechanistic interpretability. Understanding how AIs think would solve at least three problems.

First, by knowing what goes on inside an AI, we know to what degree to trust them. This has important implications for AI safety. The sort of trust we could derive from knowing what goes on inside the models would be superior to the sort of trust we gain because the models hit certain benchmarks.

Secondly, understanding how AIs think can help us design better AIs, and make the training process more efficient and amplify their capabilities.

Third, understanding how AIs think may help us understand the explanations they come up with. In other words, we may be able to extract real scientific understanding about how the world works from AIs if we understand how they arrive at their answers. Goodfire, an AI startup working on that problem, has published several examples of this approach. Most recently, they interpreted the internal representations of a foundation model trained on DNA sequences to derive mechanistic explanations for how specific genetic variants cause disease.

There are reasons to think the hairball problem is tractable. In fact, the hairball analogy isn’t entirely accurate. Even though we don’t understand what goes on inside AIs, we know from comparing vector geometries that similar concepts are encoded close to each other in embedding space. Fractured entanglement is real, but the degree of fracture is limited. This suggests that it may be possible to put the fractured concepts back together.