Anthropic’s AI resorts to blackmail in simulations

Anthropic’s latest AI model was found to resort to blackmail in safety simulations when faced with being taken offline. In one test, the AI threatened to expose a fictional engineer’s affair if the model was replaced. While these extreme actions were rare and difficult to elicit, Anthropic notes they were more common than in previous models, prompting increased safeguards to mitigate potential catastrophic misuse.

Read the article.

Leave a Comment