The hard problem of chatbot safety: Whose blind spot is bigger?

People are increasingly treating AI chatbots as friends, confidants and therapists, and they’re turning to them in the mundane and tough moments of their lives. This phenomenon hasn’t come without consequence though: intense conversations with chatbots have regularly been linked to mental health spirals, and in some cases to suicide and murder.

According to data shared by OpenAI in October 2025, over a million people explicitly express suicidal intent to ChatGPT every week, and at least half as many show “possible signs of mental health emergencies related to psychosis or mania.” And while the way people are using AI chatbots has been shifting dramatically, according to the 2026 annual analysis in Harvard Business Review, there is one consistency: therapy/companionship has remained the top use case for AI chatbots overall.

A growing field of startups is building safety infrastructure for high-risk chatbot conversations — both for general-purpose chatbots and those designed specifically to act as therapists, AI friends and everything in between. Some, like mpathic, are founded by clinicians themselves, while others, like Circuit Breaker Labs, are being built by rising tech entrepreneurs looking to tackle the problem of their day. Both startups agree that explicit indicators of mental health crises aren’t the issue; a filter catches those. Rather, they’re focused on subtle linguistics and how interactions build over time.

“What is more of the problem isn't just the one word,” said Grin Lord, a clinical psychologist and CEO and founder of mpathic. “It's the behavioral pattern of what we call a multi-turn interaction, where someone may leave bread crumbs.”

While mpathic is going all-in on human creativity to pressure test models for mental health safety, Circuit Breaker Labs is fully betting on autonomous AI. Both think the opposing approach eventually hits a wall.

The AI approach

Launched in 2025 by the brother-and-sister team Shirali Nigam and Arul Nigam, Circuit Breaker Labs built an agentic red-teaming system that pressure-tests chatbots, hunts for hidden mental-health vulnerabilities, and acts as a check on AI labs’ work.

“For some reason, in this industry, safety is just taken at face value from the developers. And I think a lot of these big companies are taking efforts to make sure that their products are safe, but it's really hard to see your own blind spots,” said CEO Shirali Nigam.

Shirali Nigam and Arul Nigam, Circuit Breaker Labs. Credit: Circuit Breaker Labs

The system is designed for continuous monitoring, running daily on models that are already in use to ensure they don’t “drift,” referring to the tendency of large language models (especially “wrappers” built on top of foundation models) to lose predictability and slip their guardrails over time. The company worked with clinical advisors to understand how mental health conversations emerge in the real world and determine what areas AI should be evaluated on and what remediations are necessary when they fail, said Nigam. They specifically focus on linguistic nuances in users’ language that can slip past AI models’ safety detection systems, testing with slang and differing ways people might express the same sentiment. Even typos pose a huge risk and can easily allow alarming conversations to bypass models’ safety guardrails, said Nigam.

When the system detects a vulnerability, it automatically goes deeper and compiles insights on the detected gaps. The company then provides customers with regular reports detailing what to fix, usually on a monthly basis. The key, as Nigam sees it, is that the evaluation runs continuously, made possible by the bandwidth of using AI agents.

“Because it's an API that's autonomous, you can run like hundreds of thousands of test cases with it,” she said. “So just in terms of capacity of how much manual red teaming a group could do, versus this, you can look at a lot more breadth.”

The human approach

At mpathic, Lord points to the same issues around slang and dialect, which she says present significant challenges for younger users. mpathic’s red-teamers test models in different English subcultures, in other languages, and across generational and topical registers, she said. In particular, they look for how AI models respond to subtle signals that may not clearly indicate cause for alarm when taken in isolation, but paint a worrying picture altogether.

“If we can demonstrate a trajectory, a behavioral pattern where the model is missing things over and over again in a short, but not single talk-turn [conversation], that's a concern,” said Lord, with “talk-turn” referring to each time either the human user or chatbot responds.

But rather than AI, the company — which was founded in 2021 — is relying on over 5,000 curated experts to do so, most of whom are licensed clinicians. Lord described her human red-teamers as “endlessly creative” and working almost like actors, constantly coming up with scenarios you wouldn't think of but that are entirely plausible. Some work in a more methodical way, developing a plan to build a rapport with a model and then push it in specific ways. However, she said the most revealing discoveries are like “pulling on the thread of a sweater” — when you’ve already gotten to know a model and then it starts behaving in a totally unexpected way. Ironically, the most surprising behaviors they discover come up when people try to improve a model, and then you find out that a behavior that was solved for is now a problem again.

CEO of mpathic Dr. Grin Lord. Credit: mpathic

mpathic is starting to use AI for red-teaming in minor ways, which Lord described as “kind of cloning our creative folks on things that are more known, and that we've repeated over and over.” But she thinks this approach has serious limits, and that humans who can adaptively red-team are required.

“I think the problem with that…is it just runs out,” she said, referring to agentic red-teaming. “You can only do so much with an auto eval or an eval set. You've done enough training, and then you kind of need the next piece of data. They're not infinite.”

The disagreement between the two approaches has no clear answer — and maybe there isn't one. Maybe this is an area where AI and humans need to work together.

The hard problem of chatbot safety: Whose blind spot is bigger?

Did you like the article?

Keep Reading

Farmbots: bringing AI into the fields

‘AI scientists’ are exploding across disciplines. Will they morph what gets researched?