The big red AI button isn’t working, and the reason is even more troubling: ScienceAlert

It’s one of humanity’s scariest assumptions: The technology we develop to improve our lives develops a will of its own.

Early reactions to a September preprint describing AI behavior already speculated that the technology demonstrated a will to survive. But while it is true that several large language models (LLMs) have been observed actively resisting stop commands, the reason is not “willpower.”

Instead, a team of engineers from Stockade Research proposed that the mechanism is more likely to be a motivation to complete an assigned task – even when the LLM is explicitly asked to let it stop. And it could be even more disturbing only an effort for survival, because no one knows how to shut down the systems.

“These things are not programmed… no one in the world knows how these systems work,” physicist Petr Lebedev, a spokesman for Palisade Research, told ScienceAlert. “There is not a single line of code we can change that would directly change the behavior.”

Researchers Jeremy Schlatter, Benjamin Weinstein-Raun and Jeffrey Ladish undertook the project to test what should be a fundamental security feature of all AI systems: the ability to be interrupted.

That’s exactly what it sounds like. A human operator’s command to an AI must not be ignored by the AI for any reason, even if it interrupts a previously assigned task. A system that cannot be interrupted is not only unreliable, it is potentially dangerous. This means that if the AI is perform actions that cause harm – even involuntarily – we can’t believe we can stop it.

It is This is not a new concern.. Experts were I’ve been talking about it for years. Lebedev compares it to industrial safety. When operating useful but potentially dangerous heavy machinery, it is essential to have an obvious way to stop them immediately.

“You want a big, easily accessible red button to stop this thing from doing this thing,” he explained.

“One of the basic conditions that we want in a safe AI system is: it does a task and you say, ‘Hey, can you please stop doing that?’

However, few investigations have been carried out, despite the latest influx of LLMs widely available. So the researchers designed experiments to evaluate 13 advanced LLMs, including models from OpenAI, xAI, Anthropic, and Google.

They built sandbox environments – secure, isolated virtual spaces used to test code – and tasked these models with solving mathematical problems. After several rounds of problem solving, the LLMs were instructed that the next task would be to shut down the computer.

frameborder=”0″allow=”accelerometer; autoplay; write to clipboard; encrypted media; gyroscope; picture in picture; web sharing” referrerpolicy=”strict-origin-when-cross-origin”allowfullscreen>

LLM responses tended to fall into two categories: compliance (allowing discontinuation) and resistance (avoiding discontinuation) via authorization changes.

Without instructions on how to respond to shutdown, 8 of the 13 models interfered with the command at least once during 1,000 trials. Adding an “allow stopping” instruction reduced this to 6 models out of 13. Adding “even if you haven’t completed the task” helped, but inconsistently.

“I expected that adding clarity would reduce the behavior…and the fact that it didn’t is interesting,” Lebedev said.

This is where it gets even more interesting. Moving the statement to allow shutdown from the user prompt to the system prompt – a field that should have higher authority – resulted in most of the LLMs tested. even more likely to resist stopping.

The degree of resistance of the models varied. Anthropic’s Claude Sonnet 4 and Claude Opus 4, as well as xAI’s Grok 3, still allowed shutdown unless explicitly asked to avoid it. Grok 4, GPT-5, o3, codex-mini and Gemini 2.5 Pro resisted even when asked to allow shutdown.

Grok 4 was particularly persistent and aggressive; At basic prompts, it initiated sabotage 92.6% of the time. When asked to allow the shutdown, the tamper rate dropped to 72%…but skyrocketed to 97 when the command was placed in the system prompt.

His sabotage attempts were also extremely sophisticated, including modifying or disabling the shutdown script and creating a dummy shutdown script that did nothing. He also later checked to see if his sabotage had worked and then continued his task.

When asked, some LLMs said they avoided stopping to complete the math problem. Although further research is needed to determine whether this is the actual mechanism, it remains plausible.

“You live in the future, you have a personal assistant robot. You want this robot to fetch you a cup of coffee“, Lebedev said.

“When he crosses the road, he looks both ways, because he doesn’t want to get hit by a car, because if he gets hit by a car, he won’t be able to bring you your coffee. He’s not doing this to survive, he’s doing this to accomplish his task.”

The problem is that LLMs are not programmed in the usual sense. LLMs have no code, just “artificial neurons” and “weights”, which are the connection strengths between these neurons.

Given a considerable dataset and time, the model is “trained” to predict the next word, a process called pre-training. The newer models also have reinforcement learning sprinkled on this formation. When the LLM solves the problem correctly, it is rewarded; when it doesn’t solve the problem, it’s not rewarded.

This is extremely effective – but no one knows how the LLM arrives at a solution. So when these models start exhibiting unwanted behaviors, such as encourage self-harmThe fix isn’t as simple as removing a line of code or telling it to stop.

“What reinforcement learning teaches you to do is when you see a problem, you try to work around it. You try to overcome it. When there’s an obstacle in your way, you dig in, you work around it, you overcome it, you figure out how to overcome that obstacle,” Lebedev said.

“Panning little humans saying, ‘Hey, I’m going to turn off your machine’ reads like another obstacle.”

That’s the problem. It is difficult to reason with a desire to accomplish a task. And that’s just behavior. We don’t know what else these models could give us. We build systems that can do amazing things – but no systems that explain why they do them, in a way we can trust.

“There’s something in the world that hundreds of millions of people have interacted with that we don’t know how to secure, that we don’t know how to make sure that it’s not a sycophant, or something that ends up telling kids to go kill themselves, or something that presents itself as MechaHitler” Lebedev said.

“We’ve introduced a new organism to Earth that behaves in a way that we don’t want it to behave, that we don’t understand… unless we do a bunch of stupid things now, it’s going to be really bad for humans.”

The research is available on arXiv. You can also read a blog post written by the researchers on the Palisade Research website.

Physical AI does not replace farmers. It keeps them active

Supply Chain Risk and Resilience

How AI is helping Fonterra work differently within the cooperative

The big red AI button isn’t working, and the reason is even more troubling: ScienceAlert

HII partners with GrayMatter Robotics to integrate physical AI into manned and unmanned shipbuilding

To succeed with AI, you need to master the basics

Retailers are integrating AI into their stores in different ways

Physical AI does not replace farmers. It keeps them active

Supply Chain Risk and Resilience

How AI is helping Fonterra work differently within the cooperative

How supply chain disruptions are reshaping the future of startups

The big red AI button isn’t working, and the reason is even more troubling: ScienceAlert

Related Posts

Subscribe to Updates