AI Security Research & LLM Vulnerability Testing
Performed AI security research and vulnerability testing on large language models as part of an AI safety research programme. Developed attack scenarios, tested model behaviour, and reported vulnerabilities.
Problem
AI systems deployed in production are vulnerable to a range of attacks. Prompt injection, data leakage, unsafe tool usage, and more. Most organisations deploying LLMs have no systematic way to test their security boundaries.
These vulnerabilities aren’t theoretical. They lead to data exfiltration, unauthorised actions, and system compromise in real-world applications.
Approach
Developed structured attack scenarios targeting specific vulnerability categories in large language models. Tested model behaviour under adversarial conditions and reported findings through responsible disclosure.
Vulnerability categories tested
- Prompt injection. Overriding system instructions through crafted inputs
- Data leakage. Extracting system prompts, training data patterns, and user context
- Unsafe tool usage. Manipulating models into making unintended API calls or accessing restricted resources
- Context manipulation. Poisoning retrieval systems to influence model outputs
- Privilege escalation. Accessing restricted functionality through adversarial prompting
Methodology
- Threat modelling. Identify attack surfaces specific to the target application
- Scenario development. Design attack sequences that chain multiple techniques
- Execution and logging. Run attacks with full logging for reproducibility
- Impact assessment. Classify findings by severity and exploitability
- Reporting. Document findings with reproduction steps and remediation recommendations
Results
- Improved model safety through identified and reported vulnerabilities
- Contributed to real AI safety research as part of a structured programme
- Developed reusable attack patterns applicable across different LLM deployments
- Informed defensive strategies. Findings directly shaped security recommendations for production systems
Lessons
The simplest attacks often work. Sophisticated jailbreaks get attention, but basic prompt injection still bypasses most defences in production applications.
Tool-use vulnerabilities are consistently underestimated. When a model can call APIs, the blast radius of prompt injection grows dramatically. It goes from “the model said something wrong” to “the model executed an unauthorised action.”
Defence in depth is the only viable strategy. No single mitigation (system prompts, input filtering, output validation) is sufficient on its own.