A Practical Guide to Red-Teaming LLMs

Why This Matters

Most organisations deploying LLMs have no systematic way to test their security boundaries. They ship a system prompt and hope for the best.

I’ve spent the past year testing LLM applications, both commercially and through bug bounty programmes. The patterns are remarkably consistent.

Common Attack Surfaces

Prompt injection (still the big one)

Despite years of awareness, most LLM applications remain vulnerable to some form of prompt injection. The fundamental issue hasn’t changed: LLMs can’t reliably distinguish instructions from data.

Tool-use abuse

When LLMs have access to tools (APIs, databases, file systems), attackers can manipulate the model into making unintended tool calls. This is often more impactful than getting the model to say something it shouldn’t.

Context window poisoning

In RAG systems, attackers can inject adversarial content into the knowledge base. The model then retrieves and acts on this content as if it were trustworthy.

Defences That Actually Help

Input/output filtering. Blunt but effective as a first layer
Principle of least privilege. Limit what tools the LLM can access
Human-in-the-loop. For high-impact actions, require confirmation
Monitoring and alerting. Log everything, flag anomalies

Defences That Don’t Help Much

“Please don’t do anything harmful” in the system prompt
Relying solely on the model’s built-in safety training
Blocklists of specific attack strings

The arms race continues. Build defence in depth and assume your model will be compromised.