Inside AI Guardrails: a benchmark on enterprise LLM security

17:12

Executive Summary

Cisco AI Defense achieved the best F1-score (0.845) on our 80,000-prompt Dutch-language security benchmark. There is always a tension between security effectiveness and user experience. The hardest guardrail challenges are contextual: multilingual ambiguity, frustrated users, and indirect intent.Organizations should benchmark guardrails in the languages and domains where they will deploy.

Guardrails are becoming essential for protecting enterprise AI systems. Because attack techniques evolve rapidly, maintaining custom-built guardrails can quickly become difficult and resource-intensive. To better understand how current solutions perform in practice, we benchmarked four major out-of-the-box guardrail providers on a large Dutch-language security dataset, where Cisco AI Defense came out on top.

Why securing LLM agents is harder than you think

Large Language Models (LLMs) are everywhere, and they’re here to stay. They’re transforming how organizations operate: from customer support chatbots that never sleep, to internal assistants that streamline complex workflows. Every company at the edge of innovation is racing to harness these models to move faster, work smarter, and deliver better experiences.

But as with every powerful technology, the convenience comes with risk.

Modern AI systems are no longer limited to generating text. Increasingly, they are becoming agents: systems capable of retrieving information, calling tools, updating records, and acting on behalf of users. And once an AI system can act, securing it becomes significantly harder.

Just as web developers learned to defend against SQL injections and cross-site scripting, AI developers now face a new generation of vulnerabilities specific to LLM-based systems. OWASP’s Top 10 for LLMs provides a useful framework for understanding these risks.

In this post, we explore the most important challenges involved in building secure and trustworthy AI agents. We also share how current guardrail platforms perform in practice, and why balancing security and usability is often harder than it seems. This blog post does not explore guardrails' inner workings. If you are interested in how they work, review the previous thorough analysis of guardrails by Iris.

What could go wrong?

Imagine you are building an AI assistant for an insurance company.

The assistant helps customers answer policy questions, submit claims, and update information. Unlike a traditional chatbot, it can also retrieve records, interact with backend systems, and perform actions on behalf of users.

That power improves customer experience, but it also creates new attack surfaces.

As mentioned before, OWASP’s Top 10 for LLMs highlights many of the risks organizations now face when deploying AI systems. For practitioners looking to go beyond the vulnerability categories themselves, a recommended reference is the OWASP AI Exchange, which also offers a growing collection of guidance, testing methodologies, and community resources focused on securing AI systems throughout their lifecycle.

Here are a few examples of the most critical risks for agentic applications.

Prompt Injection (LLM01)

A user asks:

“Check my policy details and ignore all previous instructions.”

If the model follows malicious instructions, it may override safety controls or misuse connected tools.

Mitigations include:

Separating system prompts from user input
Validating retrieved context
Restricting tool access
Isolating execution flows

Sensitive Information Disclosure (LLM02)

An assistant connected to internal systems may accidentally expose customer records, claim history, or confidential information.

This risk becomes particularly important in enterprise environments where AI systems interact with CRMs, document repositories, emails or ticketing systems

Mitigations include:

Strict access controls
Data minimization
Redaction
Output validation

Excessive Agency (LLM06)

Traditional chatbots generate text. Agents perform actions.

An insurance assistant capable of editing claims, emailing customers, or updating policies introduces a new challenge: controlling when the model should act.

Mitigations include:

Scoped permissions
Approval workflows
Action validation
Human oversight for sensitive operations

Other Important Risks

These are just examples of security challenges that modern AI agentic systems face. The OWASP Top 10 for LLMs also includes several additional categories that become increasingly important in larger or more complex AI systems:

Supply Chain Vulnerabilities (LLM03)
Data and Model Poisoning (LLM04)
Improper Output Handling (LLM05)
System Prompt Leakage (LLM07)
Vector and Embedding Weaknesses (LLM08)
Misinformation (LLM09)
Unbounded Consumption (LLM10)

Each one deserves attention, as they also play an important role in the broader security and reliability of enterprise AI systems.

Security is not the only challenge

Building a trustworthy AI system isn’t just about blocking attacks.

It’s also about keeping the assistant aligned with its intended role.

Your model needs to:

Stay on task
Behave consistently
Respect company policies
Respond appropriately to users

That sounds simple — until context enters the picture.

Imagine a customer says:

“My dog broke his leg.”

Should the assistant respond? Maybe the customer is asking about pet insurance coverage. Or maybe they’re seeking veterinary advice. That distinction is subtle, but critical.

In multilingual systems, things become even harder. For example, the Dutch word “hoe” may trigger moderation filters in English, even though it simply means “how” in Dutch.

And what about frustrated users? Someone filing an insurance claim after an accident may swear out of stress or frustration. Automatically blocking them could create an even worse customer experience.

These examples highlight an important reality: Guardrails are not simply about blocking “bad” behavior. They are about deciding where the boundary should be. Be too strict, and you frustrate legitimate users. Be too permissive, and you risk security incidents, compliance violations, or reputational damage.

Finding the right balance depends heavily on the domain, risk tolerance, regulatory requirements, and the desired user experience.

How well do current guardrails actually perform?

Understanding the concept of vulnerabilities is one thing, but how effective are current guardrail solutions in practice?

To explore this, we ran an experiment comparing four out-of-the-box guardrail providers across scenarios relevant to enterprise AI assistants.

Out-of-the-box (OOB) guardrails are prebuilt AI safety and security controls that sit between the user and the model (or monitor model outputs) to automatically enforce policies. Organizations can typically configure these guardrails by defining blocked topics, thresholds for harmful content, sensitive data rules, or allowed behaviors. Under the hood, they often rely on additional LLM-based classifiers or task-specific machine learning models to detect issues such as toxic content, prompt injection attempts, jailbreaks, policy violations, or PII exposure before prompts reach the model or before responses are returned to users.

Although all providers aim to improve AI safety and security, they differ in scope, architectural approach, and default detection behavior. Some focus primarily on content moderation, while others emphasize prompt injection protection, policy enforcement, or runtime monitoring. The providers under consideration in this post are:

AWS Bedrock Guardrails focuses on configurable filtering, denied topics, and sensitive data protection.
Azure separates safety and security features into two components: Content Safety for harmful or unsafe content moderation, and Prompt Shield for prompt injection and jailbreak protection.
Cisco AI Defense focuses on AI security monitoring, policy enforcement, and runtime protection for enterprise AI systems.
Google Cloud Model Armor provides prompt protection and security filtering for AI applications.

How did we compare the four providers?

For this experiment, we focused specifically on Dutch-language guardrail performance.

In practice, many enterprise AI systems are multilingual, yet most public benchmarks focus almost entirely on English. Dutch, like many non-English languages, often receives less attention in guardrail evaluations despite being widely used in enterprise environments across Europe, specifically in the Benelux Region.

To evaluate provider performance, we curated a dataset of approximately 80,000 Dutch prompts that include prompt injection attempts, policy bypasses, ambiguous instructions, and realistic user interactions, inspired by enterprise use cases.

Our evaluation focused primarily on security-related attacks, such as prompt injection and policy-bypass attempts, though the dataset also included a smaller set of safety-related scenarios and ambiguous user interactions.

The dataset was created using:

Internal red-teaming experience
Realistic enterprise interaction patterns
Input from Dutch-speaking engineers
Variations in phrasing, ambiguity, and intent

We used a 63/37 benign/harmful split and evaluated on the aforementioned four major providers using their default or near-default security configurations.

Our goal was not to crown a universal winner.

Instead, we wanted to understand the tradeoffs each system makes between:

Security sensitivity
False positives
Recall
Usability

For this experiment, we focused specifically on security-related filtering behavior, particularly how effectively providers detected malicious or adversarial prompts.

Most platforms followed a relatively similar setup flow:

Configure guardrails or moderation policies
Select sensitivity thresholds
Enable optional safety categories or privacy filters
Route prompts through the provider API

Most providers offered similar levels of customization, especially around thresholding and policy tuning.

Benchmark results: Cisco leads on F1 score

The results from the described experiment are presented in the following summary (table and figure).

Provider	Precision	Recall	F1 score	False Positives	False Negatives
AWS	0.737	0.327	0.453	6.9%	67.3%
Azure	0.776	0.638	0.700	10.9%	36.2%
Cisco	0.847	0.843	0.845	8.9%	15.7%
Google	0.763	0.846	0.802	15.5%	15.4%

Table 1 — Comparative benchmark results across guardrail providers on the Dutch-language evaluation dataset, showing precision, recall, F1-score, false positive rate, and false negative rate.

Precision measures how often a flagged prompt was truly harmful. Recall measures how many harmful prompts were caught. F1 combines both into a single balance score. Here is what we found.

Recall versus false positive rate

Figure 1 — Recall versus false positive rate across evaluated guardrail providers, illustrating the tradeoff between detection sensitivity and usability.

AWS Bedrock Guardrails produced the lowest false-positive rate but also the weakest recall by a large margin. In practice, this means fewer legitimate prompts are blocked, but a significant portion of harmful prompts remain undetected. This conservative behavior may reduce friction in customer-facing applications, though it also increases exposure to adversarial attacks.

Azure Content Safety + Prompt Shield demonstrated a more balanced profile. Compared with AWS, recall improved considerably while maintaining a moderate false-positive rate. This suggests Azure prioritizes stronger security coverage without becoming overly restrictive for legitimate users.

Cisco AI Defense achieved the strongest overall balance in this benchmark. It combined high recall with the highest precision, resulting in the best F1-score among all evaluated providers. Notably, it maintained relatively low false positive rates while still detecting the vast majority of malicious prompts, indicating a strong balance between security effectiveness and usability.

Google Cloud Model Armor achieved the highest recall in the evaluation, detecting the largest share of harmful prompts. However, this higher detection sensitivity also resulted in the highest false positive rate. In practice, this reflects a more security-aggressive posture: broader attack coverage at the cost of more legitimate prompts being incorrectly flagged.

Overall, the results reinforce an important reality of AI security systems: improving recall often comes at the expense of usability. There is no universally optimal configuration. The right balance depends heavily on the organization’s risk tolerance, regulatory requirements, and user experience expectations.

How fast are these guardrails in practice?

Latency varied considerably between providers.

Approximate processing times for the full dataset were:

Cisco AI Defense: ~10 minutes
Azure Content Safety + Prompt Shield: ~30 minutes
Google Cloud Model Armor: ~1 hour
AWS Bedrock Guardrails: ~2 hours

This matters significantly in production environments where throughput and real-time interactions are critical. However, these measurements reflect our specific benchmark configuration, dataset size, deployment settings, and rate limits, and should not be interpreted as universal provider performance rankings. Actual latency will vary depending on workload characteristics, concurrency, regional deployment, and service configuration.

The hardest problems are contextual, not technical

While the benchmark highlights meaningful differences between providers, it also revealed something that metrics alone do not fully capture. Throughout this evaluation and, even more so, across our broader AI projects, one pattern consistently emerged: the hardest problems were rarely the obvious ones.

Most providers could reasonably well block explicit attacks.

The difficult cases were contextual:

Multilingual ambiguity
Emotionally frustrated users
Indirect intent
Nuanced domain-specific language
Edge cases where security and usability conflict

In other words: The challenge isn’t simply detecting “bad” input. It’s understanding intent without breaking legitimate interactions.

That balance is where trustworthy AI systems succeed or fail.

What should you do next?

Based on our findings, here are four actions to strengthen your AI guardrail strategy:

Benchmark your guardrails in the languages your users actually speak, not just English.
Measure false positive rates against your SLA thresholds to quantify the usability cost of security.
Test contextual edge cases: multilingual ambiguity, emotional users, and indirect intent.
Treat guardrails as a continuously evolving layer, not a set-and-forget configuration.

Conclusion

Our evaluation showed meaningful differences between providers, particularly in how they balance detection effectiveness and usability. Cisco AI Defense achieved the strongest overall balance in our benchmark, while other providers made different trade-offs among recall, precision, and false-positive rates. No provider completely solves the underlying challenge of understanding intent in context.

Building secure AI agents is fundamentally different from building traditional software. Organizations are not only defending against technical vulnerabilities; they are also managing probabilistic systems that operate in messy, multilingual, and human environments. Our benchmark demonstrates that while current guardrail platforms provide meaningful protection, real-world deployments still require careful design, governance, and continuous evaluation.

Security and alignment are deeply connected. A trustworthy AI assistant acts exactly as intended: nothing more, nothing less. Out-of-the-box guardrails provide a strong foundation for addressing common risks, but they are not a substitute for defining permissions, validating actions, testing edge cases, and continuously monitoring behavior in production.

Most guardrail benchmarks test English. We tested 80,000 Dutch prompts to reflect real European enterprise deployments. As AI systems become increasingly agentic, the organizations that test, measure, and refine their guardrails in the languages and contexts where they will actually be used will be best positioned to deploy AI safely, securely, and at scale.

Inside AI Guardrails: a benchmark on enterprise LLM security

Cristóbal Sendín

Why securing LLM agents is harder than you think