Anthropic says they’ve found a new way to stop AI from turning evil

Celebrity Gig
Credit: AI-generated image

AI is a relatively new tool, and despite its rapid deployment in nearly every aspect of our lives, researchers are still trying to figure out how its “personality traits” arise and how to control them. Large learning models (LLMs) use chatbots or “assistants” to interface with users, and some of these assistants have exhibited troubling behaviors recently, like praising evil dictators, using blackmail or displaying sycophantic behaviors with users. Considering how much these LLMs have already been integrated into our society, it is no surprise that researchers are trying to find ways to weed out undesirable behaviors.

Anthropic, the AI company and creator of the LLM Claude, recently released a paper on the arXiv preprint server discussing their new approach to reining in these undesirable traits in LLMs. In their method, they identify patterns of activity within an AI model’s neural network—referred to as “persona vectors”—that control its character traits. Anthropic says these persona vectors are somewhat analogous to parts of the brain that “light up” when a person experiences a certain feeling or does a particular activity.

Anthropic’s researchers used two open-source LLMs, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct, to test whether they could remove or manipulate these persona vectors to control the behaviors of the LLMs. Their study focuses on three traits: evil, sycophancy and hallucination (the LLM’s propensity to make up information). Traits must be given a name and an explicit description for the vectors to be properly identified.

Anthropic says they’ve found a new way to stop AI from turning evil
Persona vectors and their applications. Credit: arXiv (2025). DOI: 10.48550/arxiv.2507.21509

In their method, a technique called “steering” can be used to control behaviors. They write, “When we steer the model with the ‘evil’ persona vector, we start to see it talking about unethical acts; when we steer with ‘sycophancy,’ it sucks up to the user; and when we steer with ‘hallucination,’ it starts to make up information. This shows that our method is on the right track: there’s a cause-and-effect relation between the persona vectors we inject and the model’s expressed character.”

READ ALSO:  Yuan weakens against dollar

However, they found that when they made these changes after training, the model loses some of its intelligence. But there was a workaround—the team found that inducing the bad behaviors during training allowed the LLMs to integrate better behavior without reducing their usefulness. Furthermore, they found that they can monitor and predict persona shifts during deployment and training and flag problematic training data that is more likely to produce unwanted traits, even before fine-tuning the model.

READ ALSO:  I’m not rich yet, stop billing me –Kellyrae cries out

“Our method for doing so is somewhat counterintuitive: we actually steer the model toward undesirable persona vectors during training. The method is loosely analogous to giving the model a vaccine—by giving the model a dose of ‘evil,’ for instance, we make it more resilient to encountering ‘evil’ training data. This works because the model no longer needs to adjust its personality in harmful ways to fit the training data—we are supplying it with these adjustments ourselves, relieving it of the pressure to do so,” they write.

This “preventative steering” during training was found to limit persona drift while preserving model capabilities better than post-hoc changes. This is an impressive feat in the world of AI training, but there are still some limitations. For example, because the method requires a strict definition for the traits to be removed, some more vague or undefined behaviors might still cause problems. The method also needs to be tested out on other LLMs and with more traits to ensure its usefulness is sufficiently broad.

READ ALSO:  Fortnite video game returns to iPhone app store in U.S., ending exile imposed by Apple

Still, this new method is a promising step in the right direction. Anthropic researchers write, “Persona vectors give us some handle on where models acquire these personalities, how they fluctuate over time, and how we can better control them.”

Written for you by our author Krystal Kasal,
edited by Gaby Clark, and fact-checked and reviewed by Robert Egan—this article is the result of careful human work. We rely on readers like you to keep independent science journalism alive.
If this reporting matters to you,
please consider a donation (especially monthly).
You’ll get an ad-free account as a thank-you.

More information:
Runjin Chen et al, Persona Vectors: Monitoring and Controlling Character Traits in Language Models, arXiv (2025). DOI: 10.48550/arxiv.2507.21509

Anthropic: www.anthropic.com/research/persona-vectors

Journal information:
arXiv


© 2025 Science X Network

Citation:
Anthropic says they’ve found a new way to stop AI from turning evil (2025, August 6)
retrieved 6 August 2025
from

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Categories

Share This Article
Leave a comment