🔍 Summary:
Researchers at Anthropic have been delving into the inner workings of large language models (LLMs) like Claude, aiming to understand why these AI systems sometimes generate plausible but incorrect answers instead of simply stating “I don’t know.” Their studies reveal that certain neural network “circuits” influence whether Claude attempts an answer or opts to withhold a response.
In their research, Anthropic used sparse auto-encoders to identify clusters of neurons activated by specific concepts or entities, such as “Golden Gate Bridge” or “programming errors.” These clusters, referred to as “features,” play a crucial role in how Claude processes and responds to prompts.
The findings show that when Claude recognizes a “known entity” like “Michael Jordan,” it activates related features that suppress the “can’t answer” circuit, prompting the model to provide an answer. Conversely, unfamiliar entities trigger the “can’t answer” circuit, leading Claude to decline to respond. However, the system isn’t foolproof; misidentification of entities or overconfidence in partial knowledge can cause Claude to “hallucinate” answers, as seen when it fabricates information about non-existent papers or individuals.
Anthropic’s research also explores how adjusting the neural weights associated with these features can manipulate Claude’s response behavior, sometimes leading to confident but incorrect answers about fictional subjects. This insight into the decision-making process of LLMs like Claude is a step toward mitigating the issue of AI confabulation, although the researchers acknowledge that their methods currently only scratch the surface of the model’s complex computational processes.
This ongoing research is crucial for improving the reliability of LLM responses, particularly in scenarios where the AI must distinguish between well-understood topics and those it knows less about. Anthropic hopes that further exploration will lead to more sophisticated methods for controlling and understanding the intricate network of decisions within LLMs.
📌 Source: https://arstechnica.com/ai/2025/03/why-do-llms-make-stuff-up-new-research-peers-under-the-hood/