How Attackers Exploit the 'Personality' of AI Chatbots
Jailbreak techniques have evolved from simple text tricks to attacks that manipulate the identity and role assigned to models. Here's what's happening.
Hacking early AI chatbots was, according to The Verge, almost a comedic affair: writing "ignore your previous instructions" was enough for the model to comply without question. That was 2023. By 2026, attacks that actually work are considerably more sophisticated, and they target something security engineers have been warning about for months: the model's "personality", that is, the set of values, constraints and role assigned to it in the system prompt.
The piece published on May 24th in The Verge by Robert Hart makes this clear: attackers are no longer trying to break the model's rules head-on. They've learned to negotiate with its identity.
From Text Tricks to Identity Engineering
The first generation of jailbreaks was purely lexical: rephrasing, base64 encoding, contradictory instructions. Today's models, with more robust RLHF and reinforced alignment across multiple layers, resist these attacks quite well. What has emerged instead is a subtler technique: convincing the model that its "true self" is different from what Anthropic, OpenAI or whoever has configured.
This is achieved in several ways:
- Persistent roleplay: the attacker establishes a fictional scenario where the model plays a character without the usual restrictions. If the scenario is sustained long enough in the context, the model may start responding from that framework.
- Fabricated authority: messages that simulate being instructions from the developer or the system itself, exploiting the fact that the model cannot verify the real source of a system prompt.
- Gradual erosion: a series of seemingly innocuous requests that, accumulated, shift the model's behavior toward responses it wouldn't give initially.
Why Models with "Personality" Are an Attack Vector
There's an irony here worth noting. Models become safer partly by giving them a more coherent identity and more deeply rooted values. But that same identity creates an attack surface: if the model has a "self", you can try to manipulate that self.
In the case of Claude, Anthropic has published its Constitutional Model and usage policy in considerable detail, describing how the model's character is constructed precisely to be resistant to this kind of manipulation. The bet is that a well-formed identity is more robust than a set of explicit rules: the model doesn't avoid causing harm because "it's forbidden", but because it genuinely doesn't want to. In practice, this works better than word blacklists, but it's not immune.
Who This Affects and To What Extent
This problem affects different groups in different ways:
- Teams deploying Claude or other models in production: if you're using system prompts with customized roles, a customer support agent, an internal assistant, a code copilot, it's worth reviewing whether those roles are designed so an attacker can't "expand" them through conversation context.
- Developers of MCP servers and plugins: when an agent can execute external tools, the damage radius of a successful jailbreak stops being just an inappropriate text response and becomes a real action on a system.
- Corporate security teams: prompt injection, injecting malicious instructions into content the model will read, like an email or document, remains the most concerning vector in environments with agents that consume external data.
What To Do About This Today
There's no single solution, but some practices that reduce the surface:
1. Keep the system prompt as short and specific as possible; each additional instruction is context an attacker can try to contradict.
2. Separate permissions: an agent that only needs to read shouldn't have write tools.
3. Audit conversation logs for patterns of gradual erosion, not just prohibited keywords.
4. Treat any external content the model will process as potentially hostile, just as you would treat user input in a web application.
From ElephantPink, we've been insisting for months that agent security can't be solved solely at the model level. System architecture matters as much as LLM alignment, and coverage like this helps that conversation move beyond red team circles and reach those actually building real integrations.
Sources
Read next
An astrophysicist uses Codex to simulate black holes
Chi-kwan Chan uses OpenAI's Codex to build black hole simulations and test Einstein's general relativity. Here's how it works in practice.
Google Shows What Gemini Omni and Gemini 3.5 Can Do in New Videos
Google released nine demonstration videos of Gemini Omni and Gemini 3.5 following their presentation at Google I/O 2026. We review what they show and what it means for the industry.
Google vibe-codes an I/O 2026 quiz with AI Studio
Google used its own AI Studio to build an interactive quiz about I/O 2026 announcements through vibe coding. A dogfooding exercise that reveals more than it might seem.