A Practical Guide to Red-Teaming LLMs

What I've learned from testing LLM security. The attack patterns that work, the defences that don't, and how to think about LLM risk.

Security AI

Why This Matters

Most organisations deploying LLMs have no systematic way to test their security boundaries. They ship a system prompt and hope for the best.

I’ve spent the past year testing LLM applications, both commercially and through bug bounty programmes. The patterns are remarkably consistent.

Common Attack Surfaces

Prompt injection (still the big one)

Despite years of awareness, most LLM applications remain vulnerable to some form of prompt injection. The fundamental issue hasn’t changed: LLMs can’t reliably distinguish instructions from data.

Tool-use abuse

When LLMs have access to tools (APIs, databases, file systems), attackers can manipulate the model into making unintended tool calls. This is often more impactful than getting the model to say something it shouldn’t.

Context window poisoning

In RAG systems, attackers can inject adversarial content into the knowledge base. The model then retrieves and acts on this content as if it were trustworthy.

Defences That Actually Help

  1. Input/output filtering. Blunt but effective as a first layer
  2. Principle of least privilege. Limit what tools the LLM can access
  3. Human-in-the-loop. For high-impact actions, require confirmation
  4. Monitoring and alerting. Log everything, flag anomalies

Defences That Don’t Help Much

  • “Please don’t do anything harmful” in the system prompt
  • Relying solely on the model’s built-in safety training
  • Blocklists of specific attack strings

The arms race continues. Build defence in depth and assume your model will be compromised.

Let's build something together

I'm always interested in hearing about new projects, particularly around AI systems, security, and infrastructure.