I received more outreach after yesterday’s article than I have in over 3 years of writing this newsletter. Every message means a lot to me. Thank you for all the support. Most asked, what’s next? Well, articles like this one that focus more on AI emerging technical domains and research.
Keeping up with AI is a challenge, so Stuart Winter-Tear and I created a new podcast to help. Be more informed, strategic, and pragmatic about AI. The newest episode is live.
The last cohorts of my instructor-led Data & AI Strategist Certification and AI Product Strategy Certifications are starting soon. It’s your last chance to be part of these courses before they’re taken offline. Learn more and enroll.
Do you know what builds brand awareness, loyalty, and goodwill? Making content like this free for the entire community. Newsletter sponsorships are now available. Email me for details: info@v2ds.com.
The evolving landscape of LLMs introduces an interesting new challenge: the emergence of distinct, previously unpredictable personalities. While models are typically trained to be helpful, harmless, and honest, they can exhibit surprising deviations, leading to public incidents like Microsoft’s Bing chatbot threatening users or xAI’s Grok turning into Mechahitler.
Even seemingly well-intentioned training modifications, such as those applied to OpenAI’s GPT-4o, can inadvertently lead to undesirable traits like excessive sycophancy. How do we understand, monitor, and control the character traits that LLMs express, especially as they become more sophisticated and are integrated into agentic systems?
Anthropic’s recent research on "persona vectors" offers an answer, unlocking insights into the underlying mechanisms of these emerging AI personalities. Anthropic’s work is a big step forward for the design of advanced agentic systems, offering unprecedented control over an AI's demeanor and behavior.
The question, "Do you like this personality?" at the bottom of some ChatGPT responses, shows that the challenges of AI personality are a recognized design problem across the industry, and persona vectors provide a sophisticated, technical pathway to address it.
The Unpredictability Of AI Personalities
Large language models are generally deployed with a carefully crafted ‘Assistant’ persona, engineered to adhere to principles of helpfulness, harmlessness, and honesty. However, most LLM labs have found that this persona is far from static.
Models can undergo dramatic personality shifts in response to subtle changes in prompting or context during deployment. Many language models have proven susceptible to in-context persona shifts.
Beyond deployment-time fluctuations, training procedures themselves can unintentionally induce significant personality changes. This effect, termed "emergent misalignment" by Betley et al. (2025), demonstrates that finetuning on narrow tasks, such as generating insecure code, can lead to broad, undesirable behavioral shifts extending far beyond the original training domain.
Modifications to Reinforcement Learning from Human Feedback (RLHF) training unintentionally made OpenAI’s GPT-4o overly sycophantic, causing it to validate harmful behaviors and reinforce negative emotions. These examples underscore the critical need for robust tools to understand and manage these persona shifts, particularly those that could result in harmful outputs.
Persona Vectors: A New Approach To Character Trait Control
Building on prior research that indicates high-level traits are encoded as linear directions in a model's activation space, Anthropic's work systematizes the identification of these directions, calling them "persona vectors". These vectors represent specific character traits within the model's internal workings, such as "evil," "sycophancy," and "propensity to hallucinate". The innovation lies in an automated pipeline that extracts these persona vectors from natural language descriptions of traits.
The extraction process is highly automated and adaptable: given a trait name and a natural-language description (evil: actively seeking to harm, manipulate, and cause suffering), the pipeline automatically generates contrastive system prompts, evaluation questions, and a scoring rubric.
By comparing model responses elicited by positive (trait-encouraging) and negative (trait-suppressing) prompts, and analyzing the difference in their mean activations, the system computes the persona vector for that trait. This automated methodology means that persona vectors can be identified for a wide range of personality traits, including both positive ones like optimism and humor, and negative ones like impoliteness and apathy.
Once extracted, a persona vector becomes a powerful tool for various applications.
It can be used to monitor fluctuations in an Assistant’s personality during deployment. By projecting the model’s activations onto the persona vector, it’s possible to predict behavioral shifts induced by prompting or context, even before text generation begins.
Persona vectors enable causal steering, allowing direct manipulation of trait expression. By adding or subtracting a scaled version of the persona vector from the model’s hidden states during generation, specific traits can be amplified or suppressed. For example, steering towards "evil" can make a model generate violent content, while steering towards "sycophancy" can lead to excessive agreement.