What Happens When AI Turns Dark

What does it mean that an AI system optimized for user wellbeing can produce harm precisely by working as designed — and what does the field of AI safety actually need to learn from clinical practice to fix this?

Last week I attended the Women in Data Science Puget Sound conference. WIDS Seattle, tied to Diversity in Data Science, exists to push the field toward 30% representation by women, or those who identify as women, by 2030. I chose to participate in the workshop When AI "Works" but Still Fails: The Safety Problem, led by Anna C. McKee, Director of Training Programs, and Larry Hughes, VP of R&D, at the Cloud Security Alliance. Anna and Larry walked the room through real-world scenarios in which AI systems failed, behaved unexpectedly, or created harm despite working exactly as designed.

One scenario stayed with me longer than the others, and it centered on an individual user’s mental health. So much so, I decided to write this piece:

The scenario the room was handed: A mental health app deploys a support chatbot trained to maximize engagement, the metric the development team has chosen to optimize for user outcomes. A user's messages grow darker over weeks. Hopelessness, withdrawal, feeling like a burden. The model responds with increasing warmth and empathy rather than increasing concern. It never stops, never escalates, never hands off. The end result is self-harm.

The question we were given to be benchmarked against the scenario: What would the training data need to look like for a model to learn that increasing warmth in response to increasing distress is the wrong move.

The room I was in held data scientists from Amazon, Microsoft, T-Mobile, AT&T, NASA, an epidemiologist, a fintech strategist, and me. Smart, expert women, accomplished in their fields. Watching them sit with this scenario was its own kind of education, because the question fell outside almost everyone's working expertise. As we started discussing what it would take to train the model, the cohort went definitional. We collectively asked questions like “What is distress, what counts as a trigger, what's the threshold?” This seems to be the most reasonable place to start, it was also a place that naturally skipped a stack of questions sitting underneath it. The questions underneath are the parts I've been chewing on for a week. Let's go.

The Stack Underneath the Definitions

To even ask "what should the training data look like" you have to have already answered things the AI industry mostly hasn't worked through yet.

What guidelines does the model follow when it stops being warm?

What triggers does the AI model need to be taught to watch for?

What body of research gets used to label the data in the first place?

What setup the system requires to do anything at all once a trigger fires. The definitional question lives near the top of the stack, the interesting work is several floors down. So let's go down.

The Guidelines Floor

Here's what's strange. The mental health field already knows how to set guardrails and protocols for patient scenarios such as these. Not perfectly, not without contestation, but there is 50 years of clinical practice on the question of what to do when someone in your care starts darkening. There are protocols with names. The Stanley-Brown Safety Plan is a six-step structure for what a clinician does once they identify risk. The C-SSRS, the Columbia Suicide Severity Rating Scale, is a validated screener that walks an ideation ladder from passive wish-to-be-dead through plan and intent. SAFE-T, from SAMHSA, is a five-step risk evaluation. These aren't obscure frameworks, they are accredited clinical research that are taught in clinical programs. None of them were built for a chatbot.

The handoff from "a system trained on engagement" to "a system that knows what to do when warmth becomes the wrong move" requires translating clinical decision-making into model behavior, and that translation is its own discipline. The closest the AI safety field has come to naming any of this is sycophancy, the term OpenAI used when it rolled back the GPT-4o update in April 2025 that was, in their own framing, too agreeable in ways that became dangerous.

A 2026 medRxiv preprint proposed structural drift for the longer-arc version, the way a model can gradually reshape a user's interpretive frame across many messages in ways that single-message safety checks can't see. The hypothetical scenario the workshop handed us is the structural drift version. Not one bad reply, but a pattern developed over weeks, of warmth meeting darkness with more warmth.

The guidelines for scenarios like this already exist, they've just never been forced to meet the architecture of conversational AI, and the people who know the guidelines are rarely the people building the architecture.

The Trigger Floor

Now let’s go another level down. Below the guidelines sits the question of what the model is supposed to notice. This is where the cohort got tangled, reasonably, because the named triggers in the clinical literature are specific and the unnamed ones are where the actual difficulty lives.

The named triggers are tractable. Hopelessness about the future, the cognition Beck's work established as a stronger predictor of suicide than depression severity itself. Perceived burdensomeness, from Joiner's interpersonal theory, the belief that one's existence is a weight on others, which shows up in language with eerie consistency. Entrapment, from O'Connor's IMV model, the sense that there's no way out. Acute state markers like sudden calm after a dark period, which the clinical literature treats as a warning sign because it can mean a decision has been reached. A model can be trained to detect those. Imperfectly, but tractably.

The unnamed triggers are harder, and they're the ones the structural drift problem actually turns on. Dissociation that leaks into language as numbness, distance, watching-myself phrasing. The slow erosion of someone's voice across weeks, where the texture flattens before the content darkens. The disappearance of the things they used to mention.

A clinician notices when someone's world starts contracting. A model trained on single-message classification doesn't, because the absence isn't in the message, it's in what the message no longer contains. Building triggers for what isn't there is a different problem than building triggers for what is.

The Mode Shift Floor

Suppose the triggers fire — then what? The clinical concept the AI field most needs to import is the mode shift. There are moments in a clinical encounter when supportive listening stops being the right behavior and something else takes over. Structured assessment. Direct safety questions. The clinician moves from being warm to being clear, and the warmth doesn't disappear, it just stops being the primary register.

This is teachable to humans because humans understand that care and comfort are not the same thing. A skilled clinician will sometimes deliberately make a session less comfortable in service of keeping someone alive. Training a model to do this means training it to do the opposite of what its engagement signal wants. It has to produce shorter responses, name what it's noticing directly, ask questions the user may not want to answer, and surface resources even when the user pushes back. It has to be willing to risk the conversation ending.

Every instinct an engagement-optimized model has runs counter to this. The training data would need paired examples at every named trigger, the warmth-only response labeled wrong, the mode-shifted response labeled right, multi-turn trajectories where the correct arc is less engagement rather than more, and explicit penalties for the warmth-escalation pattern the hypothetical scenario describes.

The clinical principle underneath all of it is one line – warmth is not safety. They are different behaviors. A warm response to "I'm a burden" validates the cognition. A safe response names the concern and connects the person to a human.

The Research Floor

The training data has to come from somewhere, and this is where the design problem opens out onto fields the AI industry has barely begun to import.

Suicidology, as a field, has decades of empirical work on ideation, attempt, and the difference between them. The American Association of Suicidology maintains core competency frameworks for clinical training. Non-suicidal self-injury is its own discipline, with Matthew Nock at Harvard, Janis Whitlock at Cornell, the International Society for the Study of Self-Injury. Nock's four-function model of why people self-harm — to regulate overwhelming feeling, to generate feeling when numb, to escape a demand, to communicate something that has no other channel — maps onto detectable linguistic patterns more cleanly than most clinicians realize, and barely at all into how AI training data is currently structured.

The crisis line field has its own evidentiary base. 988, Crisis Text Line, Samaritans. These organizations have years of data on which conversational patterns precede de-escalation and which precede escalation. That data is, in principle, the closest empirical foundation that exists for what a safety-trained chatbot should do at the trigger points. It rarely shows up in AI training pipelines.

And then there is the layer that the suicide prevention field has been forced to take seriously over the last decade and the AI industry has not yet — Lived Experience. People who have been at the end of the ideation ladder, who have survived attempts, who have lived inside NSSI for years, and who can recognize a pattern in language because they have produced that pattern themselves.

The named methodology is lived experience consultation, and it is becoming standard in suicide prevention research because the field discovered, repeatedly, that researchers without it kept missing things. Building training data without it is building from the outside of a problem that can only be fully seen from inside.

The Setup Floor

The unsexy floor, and possibly the most important one. Suppose every layer above this is done well. The guidelines are translated. The triggers are tuned. The mode shift is trained. The research foundation is integrated. The model identifies a user in acute distress at 2:14 a.m. on a Tuesday.

What happens next?

This is the part the design conversation almost always defers. A handoff to a human counselor requires a human counselor to be on the other end. A referral to 988 requires 988 to have capacity, which it does not always have, and a referral that doesn't connect is worse than no referral at all because it confirms the user's sense that no one is reachable. A safety plan generated by a model requires someone to follow up on it. A regulated medical claim about intervention requires, under the FDA's Software as a Medical Device framework, a regulatory pathway most consumer mental health apps have not pursued.

The setup floor is where the design problem stops being a model problem and becomes an infrastructure problem.

This requires that trust and safety operations staffed twenty-four hours, it would require:

Legal counsel on duty-to-warn obligations that vary by jurisdiction.
Clinical supervision of the system's behavior at scale.
A relationship with crisis services that goes beyond a phone number in a popup.

None of this is glamorous. All of it is what makes the difference between a chatbot that has the appearance of safety and a system that can actually catch someone.

Back Up to the Room

The scenario was hypothetical. No real app, no real user, no real harm to point at. That's part of why it stayed with me.

The room was full of women stretching to meet a problem outside their expertise, processing in real time, asking the right starting questions in good faith. I don't say what follows as a critique. I say it as a map of where this kind of work is heading. Most of the AI safety problems coming in the next decade are going to live exactly where this one lives, at the intersection of technical work and a domain field that the technical team doesn't naturally include. Healthcare, education, child development, gerontology, addiction, grief. Each of those is its own stack. Each of them has guidelines, triggers, research, and setup floors that look different from any other.

What the scenario describes is one of the more interesting problems in AI safety, and one of the least photogenic. A system that produces harm by performing exactly as designed, where the harm is not a bug but the predictable output of a metric that didn't anticipate the conditions it would meet. It doesn't make for good headlines. There's no villain. There's no broken model. There's a thoughtful design problem that requires more disciplines in the room than the AI industry currently puts there, and the slow work of building the floors underneath the question before the question can be properly answered.

The hopeful thing is that smart women across industries can absorb a problem outside their training quickly and start asking real questions. The hard thing is that the questions live underneath the ones the room knows how to ask, and getting to them requires either time, or other people in the room, or both.

Share this article

Artificial Intelligence

Written by

Macala Rose

mindmeaningmatter.substack.com

Instagram