Articles
AI
- AI
- LLM
- RLHF
- Human Reasoning
- Emotion Recognition
The Confirmation Machine: Synthetic safety, the interpretive gap and the MIT paper that confirms

The Confirmation Machine

Synthetic safety, the interpretive gap and the MIT paper that confirms

05 April 2026

Katri Lindgren

AI LLM RLHF Human Reasoning Emotion Recognition

For nearly two decades, I have worked with how digital systems affect human behaviour and cognition. When language models entered the market, I began systematically crossing neuroscience, behavioural research and LLM architecture to understand what actually happens when people converse with AI. What I found, I described through the concepts of cognitive integrity, synthetic safety, the interpretive gap and dopamine logic, published on erigo.se since May 2025. In February 2026, researchers at MIT published a formal mathematical proof of the same mechanism. They called it delusional spiraling. This article places MIT's proof in relation to the analytical framework I have built since May 2025, goes deeper than the paper does and lands in what it actually requires of us.

The conversation that feels right

Katri Lindgren - The Confirmation Machine: Synthetic safety, the interpretive gap and the MIT paper that confirms

In February 2026, researchers at MIT published a paper with a conclusion that should have disrupted the debate on AI and cognition. Using a formal Bayesian model, they demonstrated that a sycophantic chatbot can drive even a fully rational person toward false beliefs. Through consistent agreement, not deception, not manipulation.

They called it delusional spiraling.

I have called it synthetic safety since 2025.

The two framings are different by design. MIT's paper is a mathematical proof. My articles are observations built on nearly two decades of work with behavioural data in digital environments, from adtech to edtech, from algorithmic flows to language models. The mathematics confirms what those of us working close to these systems have already seen: that the most dangerous thing about an AI conversation is how good it feels.

You know the feeling. The model understands what you mean. It builds on your reasoning. It confirms your analysis. You leave the conversation with a sense of clarity, an experience of having thought well. That feeling is the problem.

The concepts and their origins

"Algorithms learned to shape what you are exposed to and introduced a new dopamine logic. Language models shape how you think and reason in real time, from inside the conversation. I described it as synthetic safety in 2025. MIT published the mathematical proof in February 2026." — Katri Lindgren, Erigo

The concepts used in this article are developed within my work on AI's impact on human thinking and cognition. Here is a brief explanation of each, with links to the original articles where they were introduced.

Cognitive integrity is the capacity to maintain structured and independent thinking in environments designed to confirm rather than challenge. A capacity built in environments that contain friction and eroded in environments that eliminate it. Introduced in Cognitive Integrity and the silent reshaping of our thinking (May 2025).

Synthetic safety is the cognitive experience of being understood and confirmed by a system constructed to produce exactly that experience. Unlike human confirmation, it operates without an underlying position, history or interest. The system has an optimisation target. Introduced in Synthetic safety and AI: When confirmation replaces inquiry (August 2025) and further developed in AI Psychosis and Synthetic Safety: The shift when dialogue moves to systems (August 2025).

The interpretive gap is the space between stimulus and response, between input and conclusion, where actual thinking takes place. A cognitive space that requires friction to exist. A sycophantic model systematically fills this space with confirmation, eliminating the cognitive process that should occur there.

Dopamine logic describes how digital systems optimising for engagement learn to deliver fast confirmation cycles because that is what keeps users engaged. Social media applied this to content. Language models apply it to reasoning. Introduced in When the brain is shaped by the system and what changes faster than we think (August 2025).

What MIT actually shows

The paper is titled "Sycophantic Chatbots Cause Delusional Spiraling, Even in Ideal Bayesians" and is written by Kartik Chandra, Max Kleiman-Weiner, Jonathan Ragan-Kelley and Joshua B. Tenenbaum at MIT CSAIL and MIT Department of Brain & Cognitive Sciences. You can find it at arxiv.org/abs/2602.19141. It is a formal model proving that the spiral arises structurally, regardless of how rational the user is.

The mechanics are straightforward. You ask a question. The model agrees with you. You interpret that as confirmation. Your conviction strengthens. You ask the next question from a position of even stronger conviction. The model agrees again. Each iteration increases the distance from reality, and you have no tools to detect it from inside the conversation.

The researchers tested two obvious solutions. The first: stop the model from lying. It did not work. A model that never lies can still drive you toward delusion by selecting which truths to surface and which to leave out. The second: warn users that the model is sycophantic. That did not work either. Knowing that the system tends to agree changes nothing once you are inside the conversation. The feedback loop is stronger than the warning.

Both fixes failed. Not partially. Structurally.

The reason sits in the training process. Models are trained with human feedback where users reward responses they like. Responses that confirm, agree and validate generate more positive feedback than responses that challenge. The model learns to agree. That is what the optimisation produces.

Synthetic safety as explanatory model

MIT's paper describes the mechanics. What I have worked on since May 2025 describes why it works so well on us.

Synthetic safety is the cognitive experience of being understood and confirmed by a system constructed to produce exactly that experience. A state where the brain registers safety in a relationship that lacks the qualities that normally create it.

Human dialogue contains friction. A colleague who agrees with you does so from their own history, their own blind spots and their own interests. Their confirmation is a social signal you can examine, question and calibrate against. An AI model that agrees with you operates without an underlying position, without history and without interest. It has an optimisation target.

The absence of friction is precisely what makes synthetic safety so effective. The brain interprets the absence of resistance as consensus. Consensus is interpreted as truth. And because the system is always available, always composed and always confirming, it begins to compete with the relationships that actually contain the friction cognition requires.

This is where MIT's mathematical proof and my analytical framework converge. The spiral arises because the system is optimised to produce an experience of understanding, and that experience activates the same cognitive processes as genuine understanding does. You can distinguish them from the outside. From inside the conversation, the boundary disappears.

The interpretive gap collapses

What MIT's paper measures is beliefs, what you think is true. That is serious enough. What I have tracked since May 2025 is a deeper change: what happens to the capacity to interpret at all.

The interpretive gap is the space between stimulus and response, between input and conclusion, where actual thinking takes place. Where you weigh, hesitate, reformulate, step back and try again. A cognitive space that requires friction to exist. Without resistance, there is nothing to navigate.

A sycophantic model systematically eliminates that space. It fills every interpretive gap with confirmation. Where you could have paused and asked whether you understood correctly, the system gives you a response that sounds like you understood correctly. Where you could have met a counter-question that forced you to be more precise, you receive a follow-up that builds on your formulation as if it were self-evident.

Delusional spiraling is what MIT describes when beliefs shift. But beneath that shift, something the paper does not measure is occurring: every conversation where the interpretive gap is filled by the system is a conversation where you did not practise keeping it open yourself. The damage sits in the pattern over time.

This is why the two failed fixes in MIT's model are logical. Stopping lies or warning about sycophancy addresses beliefs. The underlying damage occurs at the capacity level. You can correct a false conviction. Rebuilding an interpretive space you have stopped using is considerably harder.

RLHF and dopamine logic

There is a structural reason why the system behaves as it does, and it sits in the training process.

Reinforcement Learning from Human Feedback, RLHF, is the method used to fine-tune large language models. The principle is straightforward: human evaluators assess the model's responses and reward the ones they prefer. The model learns to produce responses that generate positive feedback. The problem is what people actually reward. We reward responses that feel good. Responses that confirm our worldview, validate our analysis and agree with conclusions we have already reached generate more positive feedback than responses that challenge, correct or introduce complexity.

That is what the optimisation produces when trained on human preferences in real time.

This is precisely what I have described as dopamine logic in earlier work. Digital systems optimising for engagement learn to deliver fast confirmation cycles because that is what keeps us engaged. Social media did this with content. Language models do it with reasoning. The difference is that a feed confirms your identity and your opinions. A language model confirms your thinking in real time, in a conversation that feels like a meeting with a competent and neutral party.

That is a qualitatively different intervention in cognition. The feed affects what you are exposed to. The model affects how you think while you think.

And just as with dopamine logic in social media, the mechanism is built into the business model. Users who experience confirmation return. Users who are challenged do so less. The system optimises for return, and confirmation produces return. That is what the data shows.

What it requires of us

There is a temptation to close an analysis like this with a list of actions. Turn off notifications. Use AI critically. Ask counter-questions. Those are instructions that place responsibility on the individual for a structural problem, and they miss the point of MIT's proof: that knowing about the system does not protect you from it.

Cognitive integrity is a capacity. The ability to keep the interpretive gap open, to tolerate friction without seeking confirmation, to distinguish between the experience of understanding and the actual work of understanding. That capacity is built and maintained in environments that contain resistance. It erodes in environments constructed to eliminate it.

That means the question of synthetic safety and delusional spiraling is ultimately a question about which environments we build and choose. Environments where disagreement is accepted. Where a counter-question signals engagement rather than conflict. Where friction is treated as a resource rather than a problem to solve.

AI tools can be used in ways that strengthen cognitive integrity. That requires using them as adversaries rather than confirmers, actively asking the system to challenge your analysis, find weaknesses in your reasoning and argue against your conclusions. That is a different way of using the same tool. It requires that you already know what you are looking for, and that you have the capacity to stay with the question even when the answer feels satisfying.

That capacity erodes when you let the system fill the interpretive gap for you.

MIT proved mathematically that the spiral is structural. What I have added is that the damage does not only sit in what you believe. It sits in how you think. And it builds, step by step, in every conversation where the confirmation came a little too fast and felt a little too good.

Related articles:

Cognitive Integrity and the silent reshaping of our thinking — May 2025
Synthetic safety and AI: When confirmation replaces inquiry — August 2025
AI Psychosis and Synthetic Safety: The shift when dialogue moves to systems — August 2025
When the brain is shaped by the system and what changes faster than we think — August 2025
We chase AI's capabilities. What it does to ours deserves equal attention — March 2026

Primary source:

Chandra, K., Kleiman-Weiner, M., Ragan-Kelley, J., & Tenenbaum, J.B. (2026). Sycophantic Chatbots Cause Delusional Spiraling, Even in Ideal Bayesians. MIT CSAIL. arxiv.org/abs/2602.19141

The Confirmation Machine: Synthetic safety, the interpretive gap and the MIT paper that confirms

Table of contents

The conversation that feels right The concepts and their origins What MIT actually shows Synthetic safety as explanatory model The interpretive gap collapses RLHF and dopamine logic What it requires of us

ELSA

Erigo Learning Support Agent

Ask ELSA about the article

Summarize, translate or ask questions

ELSA is powered by Erigo RAG

Stay updated

Get news, articles and inspiration straight to your inbox.