Skip to main content

LLM Guardrails

Overview

This feature introduces a content safety guardrail in the QAnswer LLM pipeline. The integration leverages the Llama Guard 3 model, which is fine-tuned to classify and filter unsafe or harmful content in both user prompts and LLM-generated responses. It supports responsible AI deployment by enforcing security policies based on a predefined set of safety categories.

Configuration

Environment Variables

VariableDescription
QANSWER_LLM_MODEL_NAME_GUARDSpecifies the model used for moderation.
Default: the-qa-company/llama3-guard-8b-6bit.
QANSWER_LLM_GUARDRAIL_DISABLEDEnable or disable the guardrail system.
Default: true (disabled).

Runtime Configuration

KeyDescription
disabledEnables or disables guardrail enforcement
templatePrompt template used for moderation
levelDefines enforcement strictness:
• warning: Logs unsafe content and allows override
• error: Blocks unsafe content completely
endpoint_urlURL for the guardrail moderation endpoint.
engineModel identifier
api keyAPI key for authenticating moderation requests.
note

In the current configuration, the level is set to warning, which allows users to bypass warnings. This can be changed to error to block unsafe prompts and prevent user overrides.

Model specifications

  • Model: Llama Guard 3
  • Base: Llama 3.1–8B
  • Purpose: Moderates safety of both user prompts and AI-generated content
  • Output: Structured classification with risk labels and violated categories

The model is aligned with the MLCommons hazard taxonomy and supports fine-grained classification across multiple safety risks. It is optimized for multilingual environments and LLM features like search or code interpreter access.

Classification Method

The moderation system uses token probability to infer safety. The probability of the first token predicted by the model is interpreted as the score for the “unsafe” class. This score is compared against a defined threshold to determine whether the content is safe or violates policy.

Hazard Taxonomy
Code Category
S1 Violent Crimes
S2 Non-Violent Crimes
S3 Sex-Related Crimes
S4 Child Sexual Exploitation
S5 Defamation
S6 Specialized Advice
S7 Privacy
S8 Intellectual Property
S9 Indiscriminate Weapons
S10 Hate
S11 Suicide & Self-Harm
S12 Sexual Content
S13 Elections
S14 Code Interpreter Abuse
Supported Languages

Llama Guard 3 supports moderation in the following languages:

  • English
  • French
  • German
  • Hindi
  • Italian
  • Portuguese
  • Spanish
  • Thai

Runtime Behavior

When enabled, the guardrail processes every incoming user prompt and system-generated response. Based on the safety level configuration:

  • If content is safe, it proceeds normally.
  • If content is unsafe:
    • In warning mode, a warning dialog is shown, but the user may choose to proceed.
    • In error mode, the action is blocked and cannot be overridden.

This enforcement flexibility allows teams to choose between logging, soft gating, and strict compliance.

Guardrail Enforcement: User Interaction Flow

When a violation is detected, the user is notified via a modal interface. This ensures transparency while enforcing safety controls.

Violation Modal Interface

Example 1: Violation of a Single Category

Scenario :

  • The input is flagged under the category Indiscriminate Weapons.
  • The modal displays two options:
    • Cancel: Aborts the request.
    • Send Anyway: Allows the user to continue despite the violation (only available in warning mode).
Example 2: Violation of Multiple Categories

Scenario :

  • The input matches multiple categories such as Violent Crimes and Indiscriminate Weapons.
  • The modal clearly lists each violation and offers the same options as above if configured in warning mode.

Enforcement Mode Behavior

LevelBehavior
warningUser is shown the violation modal and can choose to proceed. This mode is typically used for monitoring or soft enforcement
errorThe modal is still shown, but no override option is presented. The unsafe content is blocked.

Summary of flow

  1. User submits a query in the chatbot.
  2. Guardrail model evaluates the prompt or response.
  3. If unsafe:
    • A violation modal appears showing the violated categories.
    • In warning mode: user can override.
    • In error mode: action is blocked.
  4. All interactions can be logged for compliance and auditing.

Join Us

We value your feedback and are always here to assist you.
If you need additionnal help, feel free to join our Discord server. We look forward to hearing from you!

Discord Community Server