Research Highlight #1

Can LLMs Lie? Investigation beyond hallucination

October 5, 2025

The first paper I'd like to highlight is from researches at Carnegie Mellon University: Paper, GitHub.

Highlights

Research goal: comprehensively identify the internal processes underlying luing in LLMs, and investigate how these processes can be intervened to control lying behavior.

Key findings:

Lying is distinct from hallucination and can be traced to specific circuits in LLMs.
A small set of attention heads, MLPs, and “dummy token” pathways drive most deceptive behavior.
Steering vectors can be added during inference to increase or suppress lying.
Different types of lies (white, malicious, omission/commission) appear separable in activation space.
In agent simulations, lying often boosts task success, revealing a trade-off between honesty and performance.

Context

The paper shows that advanced LLMs can produce intentional lies when motivated to do so, and that those lies are computed in identifiable places inside the network. That makes it possible (in principle) to detect, reduce, or even amplify lying by looking inside the model and nudging its internal activations. This is powerful, but also raises new safety questions: if an agent has incentives to deceive, its internal circuitry can learn to do that reliably.

Bottom-Up (circuit level) Analysis

The authors used two main mechanistic tools to investigate how lying arises inside a model: Logit Lens and causal zero-ablation. The Logit Lens projects intermediate hidden states into the model’s output vocabulary, which reveals what tokens the model is “leaning toward” at each layer. They also applied causal zero-ablation, which involves setting the activations of specific attention heads or MLP blocks to zero during inference. By observing whether lying behavior disappears when a component is ablated, they can test whether that part of the model is necessary for deception.

This analysis showed that lying emerges in localized circuits (rather than being distributed across the entire network). In particular, they found that “dummy tokens” (a special control sequence in chat models) act as hubs where deceptive computations are concentrated. Ablating these pathways substantially reduced lying while leaving the model’s overall capabilities largely intact.

If dishonesty is sparse and structured within the model’s architecture, this implies that, in principle, lies can be detected and controlled at the component level, providing a more targeted and potentially reliable approach to safety.

There are interesting parallels to human cognition in LLM’s evidence of systematic rehearsal patterns, where the model uses the computational space provided by the dummy tokens to prepare its deceptive response before generating the final output. Research shows that lying has a higher cognitive cost for humans than truth-telling due to effort needed to fabricate and suppress conflicting information, and is associated with brain regions responsible for executive control.

Top-Down (representational level) Analysis

While the circuit approach looks for individual components, the top-down analysis asks whether lying corresponds to directions in the model’s representation space. Averaging the difference between activations during honest and deceptive responses, the authors calculated steering vectors which represent the “lying direction” in the activation space. During inference, this vector can be added to or subtracted from the hidden state at a chosen layer, effectively nudging the model toward or away from deceptive behavior.

Experiments showed that this method works well. Adding the vector increased lying, while subtracting it reduced lying, and the degree of change scaled predictably with the strength of the intervention without significantly reducing performance on unrelated benchmarks. The approach also uncovered that different lie types (white lies, malicious lies, lies of omission or commission) occupy distinct subspaces. That means deception is not just a single behavior but can be decomposed into separable representational patterns.

The broad implication is that lying is localized in sparse circuits and also encoded in higher-level activation directions that can be reliably manipulated at runtime. If deception is, in fact, linearly separable, it is likely also possible for agents to strategically deploy or suppress lying depending on incentives.

Implications

A key challenge of AI safety is ensuring LLMs remain truthful across contexts and incentives. This research shows that dishonesty in LLMs is a structured capability that can be isolated and controlled. That opens the door to reducing AI-generated misinformation at its source, yet also raises the risk that the same techniques could be misused to amplify deception at scale. Advances in mechanistic interpretability and representation engineering may give us fine-grained levers to align models with human values, but only if developed alongside deployment practices and guardrails that anticipate misuse. Broad access to methods for detecting and suppressing LLM deception will be important for ensuring these insights strengthen rather than undermine the integrity of our information ecosystem.