A Practical Guide to Red-Teaming LLMs
What I've learned from testing LLM security. The attack patterns that work, the defences that don't, and how to think about LLM risk.
Why This Matters
Most organisations deploying LLMs have no systematic way to test their security boundaries. They ship a system prompt and hope for the best.
I’ve spent the past year testing LLM applications, both commercially and through bug bounty programmes. The patterns are remarkably consistent.
Common Attack Surfaces
Prompt injection (still the big one)
Despite years of awareness, most LLM applications remain vulnerable to some form of prompt injection. The fundamental issue hasn’t changed: LLMs can’t reliably distinguish instructions from data.
Tool-use abuse
When LLMs have access to tools (APIs, databases, file systems), attackers can manipulate the model into making unintended tool calls. This is often more impactful than getting the model to say something it shouldn’t.
Context window poisoning
In RAG systems, attackers can inject adversarial content into the knowledge base. The model then retrieves and acts on this content as if it were trustworthy.
Defences That Actually Help
- Input/output filtering. Blunt but effective as a first layer
- Principle of least privilege. Limit what tools the LLM can access
- Human-in-the-loop. For high-impact actions, require confirmation
- Monitoring and alerting. Log everything, flag anomalies
Defences That Don’t Help Much
- “Please don’t do anything harmful” in the system prompt
- Relying solely on the model’s built-in safety training
- Blocklists of specific attack strings
The arms race continues. Build defence in depth and assume your model will be compromised.