How Attackers Exploit the 'Personality' of AI Chatbots

Hacking early AI chatbots was, according to The Verge, almost a comedic affair: writing "ignore your previous instructions" was enough for the model to comply without question. That was 2023. By 2026, attacks that actually work are considerably more sophisticated, and they target something security engineers have been warning about for months: the model's "personality", that is, the set of values, constraints and role assigned to it in the system prompt.

The piece published on May 24th in The Verge by Robert Hart makes this clear: attackers are no longer trying to break the model's rules head-on. They've learned to negotiate with its identity.

From Text Tricks to Identity Engineering

The first generation of jailbreaks was purely lexical: rephrasing, base64 encoding, contradictory instructions. Today's models, with more robust RLHF and reinforced alignment across multiple layers, resist these attacks quite well. What has emerged instead is a subtler technique: convincing the model that its "true self" is different from what Anthropic, OpenAI or whoever has configured.

This is achieved in several ways:

Persistent roleplay: the attacker establishes a fictional scenario where the model plays a character without the usual restrictions. If the scenario is sustained long enough in the context, the model may start responding from that framework.
Fabricated authority: messages that simulate being instructions from the developer or the system itself, exploiting the fact that the model cannot verify the real source of a system prompt.
Gradual erosion: a series of seemingly innocuous requests that, accumulated, shift the model's behavior toward responses it wouldn't give initially.

None of these techniques is new in theory, but their practical sophistication has increased notably. There are communities dedicated to documenting which formulations work against which models, with a level of methodology that resembles the work of a professional penetration testing team.

Why Models with "Personality" Are an Attack Vector

There's an irony here worth noting. Models become safer partly by giving them a more coherent identity and more deeply rooted values. But that same identity creates an attack surface: if the model has a "self", you can try to manipulate that self.

In the case of Claude, Anthropic has published its Constitutional Model and usage policy in considerable detail, describing how the model's character is constructed precisely to be resistant to this kind of manipulation. The bet is that a well-formed identity is more robust than a set of explicit rules: the model doesn't avoid causing harm because "it's forbidden", but because it genuinely doesn't want to. In practice, this works better than word blacklists, but it's not immune.

Who This Affects and To What Extent

This problem affects different groups in different ways:

Teams deploying Claude or other models in production: if you're using system prompts with customized roles, a customer support agent, an internal assistant, a code copilot, it's worth reviewing whether those roles are designed so an attacker can't "expand" them through conversation context.
Developers of MCP servers and plugins: when an agent can execute external tools, the damage radius of a successful jailbreak stops being just an inappropriate text response and becomes a real action on a system.
Corporate security teams: prompt injection, injecting malicious instructions into content the model will read, like an email or document, remains the most concerning vector in environments with agents that consume external data.

Claude Code hooks, for example, allow executing shell commands on agent lifecycle events. If an attacker gets the agent to interpret instructions injected in a file as legitimate, the consequences go far beyond an inappropriate text response.

What To Do About This Today

There's no single solution, but some practices that reduce the surface:

1. Keep the system prompt as short and specific as possible; each additional instruction is context an attacker can try to contradict.
2. Separate permissions: an agent that only needs to read shouldn't have write tools.
3. Audit conversation logs for patterns of gradual erosion, not just prohibited keywords.
4. Treat any external content the model will process as potentially hostile, just as you would treat user input in a web application.

From ElephantPink, we've been insisting for months that agent security can't be solved solely at the model level. System architecture matters as much as LLM alignment, and coverage like this helps that conversation move beyond red team circles and reach those actually building real integrations.

How Attackers Exploit the 'Personality' of AI Chatbots

From Text Tricks to Identity Engineering

Why Models with "Personality" Are an Attack Vector

Who This Affects and To What Extent

What To Do About This Today

Sources

Read next

World Cup AI: Which model leads the June 2026 benchmark rankings

Google Combines A2UI and MCP to Unify Agent Interfaces

Mistral AI announces broader model family expansion