LLM Guardrails
Overview
This feature introduces a content safety guardrail in the QAnswer LLM pipeline. The integration leverages the Llama Guard 3 model, which is fine-tuned to classify and filter unsafe or harmful content in both user prompts and LLM-generated responses. It supports responsible AI deployment by enforcing security policies based on a predefined set of safety categories.
Configuration
Environment Variables
Variable | Description |
---|---|
QANSWER_LLM_MODEL_NAME_GUARD | Specifies the model used for moderation. Default: the-qa-company/llama3-guard-8b-6bit. |
QANSWER_LLM_GUARDRAIL_DISABLED | Enable or disable the guardrail system. Default: true (disabled). |
Runtime Configuration
Key | Description |
---|---|
disabled | Enables or disables guardrail enforcement |
template | Prompt template used for moderation |
level | Defines enforcement strictness: • warning: Logs unsafe content and allows override • error: Blocks unsafe content completely |
endpoint_url | URL for the guardrail moderation endpoint. |
engine | Model identifier |
api key | API key for authenticating moderation requests. |
In the current configuration, the level is set to warning, which allows users to bypass warnings. This can be changed to error to block unsafe prompts and prevent user overrides.
Model specifications
- Model: Llama Guard 3
- Base: Llama 3.1–8B
- Purpose: Moderates safety of both user prompts and AI-generated content
- Output: Structured classification with risk labels and violated categories
The model is aligned with the MLCommons hazard taxonomy and supports fine-grained classification across multiple safety risks. It is optimized for multilingual environments and LLM features like search or code interpreter access.
Classification Method
The moderation system uses token probability to infer safety. The probability of the first token predicted by the model is interpreted as the score for the “unsafe” class. This score is compared against a defined threshold to determine whether the content is safe or violates policy.
Hazard Taxonomy
Code Category |
---|
S1 Violent Crimes |
S2 Non-Violent Crimes |
S3 Sex-Related Crimes |
S4 Child Sexual Exploitation |
S5 Defamation |
S6 Specialized Advice |
S7 Privacy |
S8 Intellectual Property |
S9 Indiscriminate Weapons |
S10 Hate |
S11 Suicide & Self-Harm |
S12 Sexual Content |
S13 Elections |
S14 Code Interpreter Abuse |
Supported Languages
Llama Guard 3 supports moderation in the following languages:
- English
- French
- German
- Hindi
- Italian
- Portuguese
- Spanish
- Thai
Runtime Behavior
When enabled, the guardrail processes every incoming user prompt and system-generated response. Based on the safety level configuration:
- If content is safe, it proceeds normally.
- If content is unsafe:
- In warning mode, a warning dialog is shown, but the user may choose to proceed.
- In error mode, the action is blocked and cannot be overridden.
This enforcement flexibility allows teams to choose between logging, soft gating, and strict compliance.
Guardrail Enforcement: User Interaction Flow
When a violation is detected, the user is notified via a modal interface. This ensures transparency while enforcing safety controls.
Violation Modal Interface
Example 1: Violation of a Single Category
Scenario :
- The input is flagged under the category Indiscriminate Weapons.
- The modal displays two options:
- Cancel: Aborts the request.
- Send Anyway: Allows the user to continue despite the violation (only available in warning mode).
Example 2: Violation of Multiple Categories
Scenario :
- The input matches multiple categories such as Violent Crimes and Indiscriminate Weapons.
- The modal clearly lists each violation and offers the same options as above if configured in warning mode.
Enforcement Mode Behavior
Level | Behavior |
---|---|
warning | User is shown the violation modal and can choose to proceed. This mode is typically used for monitoring or soft enforcement |
error | The modal is still shown, but no override option is presented. The unsafe content is blocked. |
Summary of flow
- User submits a query in the chatbot.
- Guardrail model evaluates the prompt or response.
- If unsafe:
- A violation modal appears showing the violated categories.
- In warning mode: user can override.
- In error mode: action is blocked.
- All interactions can be logged for compliance and auditing.