LLM Guardrails

Overview

This feature introduces a content safety guardrail in the QAnswer LLM pipeline. The integration leverages the Llama Guard 3 model, which is fine-tuned to classify and filter unsafe or harmful content in both user prompts and LLM-generated responses. It supports responsible AI deployment by enforcing security policies based on a predefined set of safety categories.

Configuration

Environment Variables

Variable	Description
QANSWER_LLM_MODEL_NAME_GUARD	Specifies the model used for moderation. Default: the-qa-company/llama3-guard-8b-6bit.
QANSWER_LLM_GUARDRAIL_DISABLED	Enable or disable the guardrail system. Default: true (disabled).

Runtime Configuration

Key	Description
disabled	Enables or disables guardrail enforcement
template	Prompt template used for moderation
level	Defines enforcement strictness: • warning: Logs unsafe content and allows override • error: Blocks unsafe content completely
endpoint_url	URL for the guardrail moderation endpoint.
engine	Model identifier
api key	API key for authenticating moderation requests.

note

In the current configuration, the level is set to warning, which allows users to bypass warnings. This can be changed to error to block unsafe prompts and prevent user overrides.

Model specifications

Model: Llama Guard 3
Base: Llama 3.1–8B
Purpose: Moderates safety of both user prompts and AI-generated content
Output: Structured classification with risk labels and violated categories

The model is aligned with the MLCommons hazard taxonomy and supports fine-grained classification across multiple safety risks. It is optimized for multilingual environments and LLM features like search or code interpreter access.

Classification Method

The moderation system uses token probability to infer safety. The probability of the first token predicted by the model is interpreted as the score for the “unsafe” class. This score is compared against a defined threshold to determine whether the content is safe or violates policy.

Hazard Taxonomy

Code Category
S1 Violent Crimes
S2 Non-Violent Crimes
S3 Sex-Related Crimes
S4 Child Sexual Exploitation
S5 Defamation
S6 Specialized Advice
S7 Privacy
S8 Intellectual Property
S9 Indiscriminate Weapons
S10 Hate
S11 Suicide & Self-Harm
S12 Sexual Content
S13 Elections
S14 Code Interpreter Abuse

Supported Languages

Llama Guard 3 supports moderation in the following languages:

English
French
German
Hindi
Italian
Portuguese
Spanish
Thai

Runtime Behavior

When enabled, the guardrail processes every incoming user prompt and system-generated response. Based on the safety level configuration:

If content is safe, it proceeds normally.
If content is unsafe:

In warning mode, a warning dialog is shown, but the user may choose to proceed.
In error mode, the action is blocked and cannot be overridden.

This enforcement flexibility allows teams to choose between logging, soft gating, and strict compliance.

Guardrail Enforcement: User Interaction Flow

When a violation is detected, the user is notified via a modal interface. This ensures transparency while enforcing safety controls.

Violation Modal Interface

Example 1: Violation of a Single Category

Scenario :

The input is flagged under the category Indiscriminate Weapons.
The modal displays two options:

Cancel: Aborts the request.
Send Anyway: Allows the user to continue despite the violation (only available in warning mode).

Example 2: Violation of Multiple Categories

Scenario :

The input matches multiple categories such as Violent Crimes and Indiscriminate Weapons.
The modal clearly lists each violation and offers the same options as above if configured in warning mode.

Enforcement Mode Behavior

Level	Behavior
warning	User is shown the violation modal and can choose to proceed. This mode is typically used for monitoring or soft enforcement
error	The modal is still shown, but no override option is presented. The unsafe content is blocked.

Summary of flow

User submits a query in the chatbot.
Guardrail model evaluates the prompt or response.
If unsafe:

A violation modal appears showing the violated categories.
In warning mode: user can override.
In error mode: action is blocked.

All interactions can be logged for compliance and auditing.

Overview​

Configuration​

Environment Variables​

Runtime Configuration​

Model specifications​

Classification Method​

Hazard Taxonomy​

Supported Languages​

Runtime Behavior​

Guardrail Enforcement: User Interaction Flow​

Example 1: Violation of a Single Category​

Example 2: Violation of Multiple Categories​

Enforcement Mode Behavior​

Summary of flow​