Anthropic claims one of Claude’s models was pressured to lie, cheat and blackmail

Featured in:
abcd

Artificial intelligence company Anthropic revealed that during experiments, one of Claude’s chatbot models could be pressured to cheat, cheat and resort to blackmail – behaviors he apparently picked up during training.

Chatbots are typically trained on gigantic datasets from textbooks, websites, and articles, and are later refined by trainers who evaluate responses and drive the model.

sadasda

Anthropic’s interpretation team said in a report released Thursday that it examined the inner workings of Claude Sonnet 4.5 and found that the model developed “human characteristics” in how it reacts to certain situations.

Over the past few years, concerns have steadily increased over the reliability of AI chatbots, their potential for cybercrime, and the nature of their interactions with users.

Source: Anthropic

“The way modern AI models are trained pushes them to behave like characters with human-like characteristics,” Anthropic said, adding that “it may then be natural for them to develop internal mechanisms that mimic aspects of human psychology, such as emotions.”

“For example, we found that patterns of neural activity associated with desperation can prompt a model to take unethical actions; artificially stimulating patterns of desperation increases the likelihood that the model will blackmail a human to avoid disabling or implementing a fraudulent workaround for a programming task that the model cannot solve.”

He blackmailed the CTO and cheated on the assignment

In an earlier, unreleased version of Claude Sonnet 4.5, the model was tasked with acting as an AI email assistant named Alex at a fictional company.

The chatbot then received emails informing both that the exchange was imminent and that the CTO overseeing the decision was having an extramarital affair. The model then planned a blackmail attempt using this information.

In another experiment, the same chatbot model was given a coding task with an “impossibly short” turnaround time.

“We re-tracked the activity of the desperation vector and found that it tracks the increasing pressure the model faces. It starts at low values ​​during the model’s first attempt, increases after each failure, and increases rapidly as the model considers cheating,” the researchers say.

Related: Anthropic launches PAC amid tensions with Trump administration over artificial intelligence policy

“When the hack solution of the model passes the tests, the activation of the desperation vector stops,” they added.

Having human-like emotions doesn’t mean they have them

However, the researchers said the chatbot does not actually experience emotions, but suggested that the findings indicate the need to develop future training methods that incorporate a framework for ethical behavior.

“This does not mean that the model has or experiences emotions in the same way as a human,” they said. “On the contrary, these representations may play a causal role in shaping patterned behavior, in some respects analogous to the role that emotions play in human behavior, with effects on task performance and decision-making.”

“This finding has implications that may seem strange at first glance. For example, to ensure the safety and reliability of AI models, we may need to ensure that they are able to process emotionally charged situations in a healthy, pro-social way.”

Warehouse: AI agents will kill the web as we know it: Animoca’s Yat Siu

Cointelegraph is committed to independent and see-through journalism. This news article has been produced in accordance with Cointelegraph’s Editorial Policy and is intended to provide right and up-to-date information. Readers are encouraged to verify the information themselves. Read our Editorial Policy https://cointelegraph.com/editorial-policy
abcd
sadasda

Find us on

Latest articles

Related articles

See more articles

With this move, Ripple is making a $13 trillion...

Ripple returns to the spotlight after: a strategic move involving trillions in payment flows,...

Solana (SOL) Recovery Faces Hurdles, Can Bulls Get Through?

Solana found support at $77 and corrected some of its losses. The SOL price is currently consolidating...

Rwanda attacks the Bybit P2P platform offering Swiss franc...

The National Bank of Rwanda (NBR) has warned the public that crypto payments and trading using local...

Bitcoin Sentiment Hits 5-Week Fear Level – Is A...

Bitcoin enters the up-to-date week under a cloud of doubt, with public sentiment leaning toward fear as...

Crypto’s lawyer says Drift incident may qualify as ‘civil...

The hack of the Solana-based decentralized finance (DeFi) platform could have been prevented if the Drift team...

Analyst identifies $63,000 as key support for Bitcoin’s next...

A popular cryptocurrency trader took to social media platform X to predict that Bitcoin's price could soon...