AI Security Research & LLM Vulnerability Testing

Problem

AI systems deployed in production are vulnerable to a range of attacks. Prompt injection, data leakage, unsafe tool usage, and more. Most organisations deploying LLMs have no systematic way to test their security boundaries.

These vulnerabilities aren’t theoretical. They lead to data exfiltration, unauthorised actions, and system compromise in real-world applications.

Approach

Developed structured attack scenarios targeting specific vulnerability categories in large language models. Tested model behaviour under adversarial conditions and reported findings through responsible disclosure.

Vulnerability categories tested

Prompt injection. Overriding system instructions through crafted inputs
Data leakage. Extracting system prompts, training data patterns, and user context
Unsafe tool usage. Manipulating models into making unintended API calls or accessing restricted resources
Context manipulation. Poisoning retrieval systems to influence model outputs
Privilege escalation. Accessing restricted functionality through adversarial prompting

Methodology

Threat modelling. Identify attack surfaces specific to the target application
Scenario development. Design attack sequences that chain multiple techniques
Execution and logging. Run attacks with full logging for reproducibility
Impact assessment. Classify findings by severity and exploitability
Reporting. Document findings with reproduction steps and remediation recommendations

Results

Improved model safety through identified and reported vulnerabilities
Contributed to real AI safety research as part of a structured programme
Developed reusable attack patterns applicable across different LLM deployments
Informed defensive strategies. Findings directly shaped security recommendations for production systems

Lessons

The simplest attacks often work. Sophisticated jailbreaks get attention, but basic prompt injection still bypasses most defences in production applications.

Tool-use vulnerabilities are consistently underestimated. When a model can call APIs, the blast radius of prompt injection grows dramatically. It goes from “the model said something wrong” to “the model executed an unauthorised action.”