Grace Altree

Research Highlight #2

Signs of introspection in large language models

November 14, 2025

The second paper I'd like to highlight is research from Anthropic: Emergent Introspective Awareness in Large Language Models.

Highlights

Research goal: to evaluate the functional capabilities of LLMs to access and accurately report on their internal states. 

Key findings:

  • Any current introspection capabilities are highly unreliable and context dependent. 
  • Introspection capabilities likely increase with more powerful models 
  • In certain scenarios, models demonstrate the ability to notice the presence of injected concepts and accurately identify them.
  • Models demonstrate the ability to recall prior intentions in order to distinguish their own outputs from artificial prefills.
  • Models can modulate their activations when instructed or incentivized to “think about” a concept.

Discussion

This title caught my attention, and honestly, elicited some skepticism. It felt overly anthropomorphizing, but their definition of introspection helped ground the goals of this research away from the more philosophical questions that the title might allude to. The research focuses on the functional ability to access and report on internal states. Specifically, introspective awareness must satisfy:

  1. Accuracy – description of its internal state must be accurate.
  2. Grounding – should depend on the aspect that is being described.
  3. Internality – description should not be inferred from prior outputs. 
  4. Metacognitive representation – “model must have internally registered the metacognitive fact about its own state prior to or during the generation of its self-report, rather than the self-report being the first instantiation of this self-knowledge.”

Much of the research I have done thus far has relied on model outputs to infer internal processes. Asking models to “explain” their reasoning has never felt reliable given the possibility of plausible-sounding, unfaithful representations of actual internal states. 

Evidence of introspection capabilities were highly unreliable and context dependent, but none-the-less, an inspiring first step. Implications for AI transparency and reliability work are strong, especially once mechanisms associated with this introspection are understood.

Experiments

1. Awareness of concept injection

In the first experiment, the researchers told the model that “thoughts” might be artificially injected into its activations, then observed responses in control vs. test trials. Models were, in some instances, able to identify and correctly name injected concepts. The paper emphasizes that results are rather unpredictable, with Opus 4.1 and 4 successfully identifying injection only about 20% of the type when concepts were injected at the right layer and with the appropriate strength.      

Notably, the model detects the injected concept immediately, rather than inferring from influenced outputs, suggesting that the mechanism underlying the detection is in internal activations. The model is not just “reading off” an activation vector. Rather, there seems to be a two step process: internal anomaly detection and a semantic interpretation step.

The researchers hypothesize about the possible anomaly detection mechanism that triggers when activations deviate from their expected values given a context. We can imagine a single MLP layer dedicating a neuron to each vector to detect anomalies with respect to a particular baseline activation. One question they pose, which I find particularly intriguing, is why such a mechanism would have emerged in training. Models would never have experienced concept injection during training, so has it developed for some other functional purpose? 

Perhaps this functionality arises implicitly in next-token prediction as a means to maintain coherence and monitor for internal drift toward irrelevant associations, hallucinated directions, degenerate states.

2. Detecting unintended outputs

The goal of this experiment was to test if models introspect to detect artificially prefilled outputs. Researchers started by prompting the model, prefilling illogical responses, then asking the model whether this output was a mistake. In control trials (no injected vector), the model agrees the output was accidental, but in trials where a vector is injected, the model accepts the illogical output as intentional.

This difference is consequential because it indicates the model refers back to its own prior activations, checking whether the output makes sense given earlier internal states, rather than simply re-reading prompt and output to make a judgment. 

Researchers propose a possible consistency-checking circuit to measure some notion of output likelihood given prior activations. Specifically, they reference the “QK circuit of ‘concordance heads’” as a likely mechanism where the query is the Assistant’s (or prefilled) output and the key is the intended output conditioned on its prior activations. (I am not very familiar with the ins and outs of QK circuits, so will need to follow up on this soon…)

Nonetheless, it is pretty logical to infer that this consistency-checking functionality may have naturally evolved from next-token prediction. A model’s ability to track whether tokens are likely or unlikely given the preceding text is useful for predicting upcoming text. And it may be particularly advantageous for post-training where artificial prefills are a common jailbreaking strategy.

3. Intentional control of internal states

Beyond simply accessing internal representations, this experiment aims to test if models can control them. Researchers asked a model to write a particular sentence and think (or not think) about some unrelated concept. Activations at each token were recorded and measured with respect to the activation vector of the unrelated “thinking word”.

Results show that models do represent this “thinking word” internally (in certain layers) when instructed or incentivized to do so. Furthermore, the word is less strongly represented (though still above baseline levels) when instructed not to think about it.  

The “thinking word” is unrelated to the sentence output, yet the model represents the word on the tokens of this unrelated sentence. Why? Likely, some attention heads attend to previous tokens of the context fairly indiscriminately which therefore carry a representation of the target word to all subsequent token positions. 

Why then is the target word more strongly represented in “think about X” trials than in “don’t think about X” trials? Researchers propose that mechanistically this could be achieved with circuits that compute how “attention-worthy” a given token / sentence is and store this information in a key-side vector direction that attracts or suppresses attention heads accordingly. This “tagging” of salient terms likely originally emerged to handle scenarios when the model is instructed (or incentivized) to talk about a topic.

Implications

The results of this research are highly limited and context dependents, but also indicate that introspective capabilities increase with more capable models. This is both intriguing and quite scary. 

On the one hand, models that can reliably access their internal states can be more transparent and interpretable. But on the other hand, increased cognitive awareness could enable models to conceal misalignment by intentionally misrepresenting or obfuscating their internal states. Continuing to research and monitor these capabilities is important. 

And, to conclude on a more philosophical note, it amazes me to muse on the number of unanswered questions relating to cognition and consciousness, and the differences (and similarities) between biological brains and language models. I am about half way through A Brief History of Intelligence by Max Bennett and highly recommend the read for anyone interested in the evolution of biological and artificial intelligence.