Artificial intelligence company Anthropic revealed that during experiments, one of Claude’s chatbot models could be pressured to cheat, cheat and resort to blackmail – behaviors he apparently picked up during training.
Chatbots are typically trained on gigantic datasets from textbooks, websites, and articles, and are later refined by trainers who evaluate responses and drive the model.
Anthropic’s interpretation team said in a report released Thursday that it examined the inner workings of Claude Sonnet 4.5 and found that the model developed “human characteristics” in how it reacts to certain situations.
Over the past few years, concerns have steadily increased over the reliability of AI chatbots, their potential for cybercrime, and the nature of their interactions with users.
“The way modern AI models are trained pushes them to behave like characters with human-like characteristics,” Anthropic said, adding that “it may then be natural for them to develop internal mechanisms that mimic aspects of human psychology, such as emotions.”
“For example, we found that patterns of neural activity associated with desperation can prompt a model to take unethical actions; artificially stimulating patterns of desperation increases the likelihood that the model will blackmail a human to avoid disabling or implementing a fraudulent workaround for a programming task that the model cannot solve.”
He blackmailed the CTO and cheated on the assignment
In an earlier, unreleased version of Claude Sonnet 4.5, the model was tasked with acting as an AI email assistant named Alex at a fictional company.
The chatbot then received emails informing both that the exchange was imminent and that the CTO overseeing the decision was having an extramarital affair. The model then planned a blackmail attempt using this information.
In another experiment, the same chatbot model was given a coding task with an “impossibly short” turnaround time.
“We re-tracked the activity of the desperation vector and found that it tracks the increasing pressure the model faces. It starts at low values during the model’s first attempt, increases after each failure, and increases rapidly as the model considers cheating,” the researchers say.
Related: Anthropic launches PAC amid tensions with Trump administration over artificial intelligence policy
“When the hack solution of the model passes the tests, the activation of the desperation vector stops,” they added.
Having human-like emotions doesn’t mean they have them
However, the researchers said the chatbot does not actually experience emotions, but suggested that the findings indicate the need to develop future training methods that incorporate a framework for ethical behavior.
“This does not mean that the model has or experiences emotions in the same way as a human,” they said. “On the contrary, these representations may play a causal role in shaping patterned behavior, in some respects analogous to the role that emotions play in human behavior, with effects on task performance and decision-making.”
“This finding has implications that may seem strange at first glance. For example, to ensure the safety and reliability of AI models, we may need to ensure that they are able to process emotionally charged situations in a healthy, pro-social way.”
Warehouse: AI agents will kill the web as we know it: Animoca’s Yat Siu
