LLM-Moderation and Safety Layer - Phase 2
Overview
With Phase 2 of the LLM-Moderation & Safety Layer, we introduce enhanced controls, expanded moderation coverage, and organization-level governance capabilities for safer and more flexible LLM interactions within the application.
Phase 1 focused on moderating user input going into the LLM, refer to this for Phase 1 Phase 2 extends this functionality significantly by adding:
Organization-level toggles to enable or disable the LLM moderation feature.
Moderation for LLM output (messages generated by the bot).
Enhanced safety checkpoints have been integrated and expanded for both user-to-LLM and LLM-to-user interactions.
How these features work, explained below:
Organization-Level Moderation Control
Previously (Phase 1), LLM moderation was always enabled with no ability for the application or organization to turn it off.
Navigate to Organization‑level configuration panel where an admin can manage different moderation settings related to LLM (Large Language Model) safety controls.
Admins can enable or disable LLM moderation at the organization level.
This toggle controls whether safety checks are applied to:
User input sent to the LLM
LLM-generated output sent back to the user

User Input Moderation
This toggle manages moderation for user messages sent to the bot. When turned on: Every user message is first checked by the moderation service. Harmful, offensive, or unsafe content is flagged or blocked before it reaches the AI. This helps prevent risky prompts from being sent to the AI.
ON
LLM Output Moderation
Toggle moderation reviews AI-generated messages before showing them to users. If harmful or offensive content is detected, it is blocked or replaced with a safe response. This prevents the bot from sending inappropriate messages.
ON
Moderation of LLM-Output
Phase 1 only checked user messages before they were sent to the LLM.
Phase 2 adds:
Moderation for responses generated by the LLM
Offensive, harmful, unsafe, or policy-violating bot messages are detected before reaching the user
Examples of what gets flagged:
Hate speech
Self-harm content
Violence-inciting content
Explicit or offensive language
Legal/medical advice violations (configurable)
Let's examine how Output Moderation functions. Train a bot using the words identified as offensive and violating, as shown in the screenshot below.

To understand bot responses and the output moderation process, we'll take a closer look at the way responses are generated and evaluated.

End-to-End Moderation Flow
The moderation pipeline now covers the full chat lifecycle.
User → LLM (Input Moderation)
Validate user message
If flagged → block, warn, or sanitize (based on configuration)
If allowed → pass to LLM
LLM → User (Output Moderation)
Validate AI-generated message
If flagged → block or replace with safe fallback
If allowed → deliver to user
Moderation Workflow Logging
To ensure every chat request undergoes moderation, we've implemented detailed step-by-step logs. These logs allow for easy tracking of each processing stage, confirming that moderation functions correctly.

The interface allows someone to filter activity logs based on different criteria. At the top, there is a date range selector to choose when the activity occurred. Below that, there are several filter options, including: Log Type, Activity Type, Module Name (optional), Reference Number (optional), Organization, Employee email
There is also a toggle to show anonymous logs and an option to filter logs by session ID. After choosing the filters, the user can click the Filter button to view the matching activity logs
Let's see how the flagged content or record looks:

When the system detects flagged content, we create a detailed log entry to show exactly what happened during moderation. Since we use Azure Content Safety libraries, the moderation service automatically evaluates the text and assigns a severity score based on the words or sentences that triggered the alert.
Azure returns two key pieces of information:
Severity Score – Indicates how serious the detected issue is.
Category – Specifies the type of content that was flagged (e.g., violence, hate, self‑harm).
In this example, the moderation output showed a severity score of 2 under the Violence category. Along with this, we also capture which part of the content triggered the flag, so it’s clear what caused the alert.
All of this information is recorded as part of the moderation workflow logs. Each step—from the bot’s response to the moderation check and the flagged result—is logged to ensure full transparency and traceability.
LLM Moderation Check within RAG Training Pipeline
It is about adding a content‑safety filtering step into the process of preparing or retrieving data for a Retrieval-Augmented Generation (RAG) model. Before a document is added to a vector store or returned to the LLM during retrieval, the system uses another LLM to check whether the content is safe, allowed, and appropriate.
Moderation can happen at two points in a RAG pipeline:
Ingestion / Training Phase: where documents are collected, cleaned, chunked, and stored in a vector database.
Retrieval + Generation Phase: where the system retrieves relevant chunks and generates an answer.
An LLM Moderation Check is simply a safety filter inserted either before ingestion, before retrieval, or both.
How to enable the RAG Training Moderation Studio Mode
To upload RAG files in Studio Mode refer to this document.
Navigate to organization-level configuration page for managing LLM-Moderation settings. These settings control how and when AI moderation is applied across your bots and RAG (Retrieval-Augmented Generation) pipelines.
Select the LLM Moderation tab and then you have three independent moderation toggles:
User Input Moderation
LLM Output Moderation
RAG Training Moderation
For RAG Training enable the User Input and RAG-Training toggle button.
User Input Moderation: When this is turned ON, every message the user types into any bot will first be checked by the moderation system.
RAG Training Moderation: When this is ON, all uploaded files, documents, and training materials used in RAG pipelines are checked by the moderation model before being ingested.

After enabling the toggle buttons, either create a custom app or choose an existing one. Go to AI-Content tab and add the new content using files, URLs, or documents based on the available options.

There are four tabs, each representing a different method to supply content
Upload Files: It allows users to upload files directly from their local device.
Public URLs: This tab will allow users to submit public web links for ingestion.
SharePoint: This tab allows users to add documents from SharePoint repositories.
Contents: This tab likely allows adding raw text or manual content entries.
Each tab corresponds to a source of data for the AI content pipeline.
Upload Files
By clicking on Choose File, user can upload multiple files at once,
Support file types are : [".pdf", ".doc", ".docx", ".csv", ".xls", ".xlsx", ".txt", ".pptx"]
Maximum file size: 5 MB per file : Indicates the maximum allowed size for each uploaded file.
Maximum file count: 20 files : Total number of files allowed in a single upload session.
Note: Only plain text files supported.

Once the training is done, the list of contents will reflect under the AI Content tab.

Public URLs
Similarly, to add Public URLs,
Navigate to the Public URL tab and add the web URL,

Once the URL is added, you get to see the added URL in the list of contents will reflect under the AI Content tab.

SharePoint
Navigate to the SharePoint tab and choose the files from the drop-down list,

Once the selected content is added from SharePoint, you get to see the added content in the list of contents will reflect under the AI-Content tab.

Contents
Navigate to the Contents tab and add the content available from the dropdown.

Once the selected content is added from Contents, you get to see the added content in the list of contents will reflect under the AI Content tab.

Now you will see the entire list of files or contents added through all the four means under the AI-Content.
Files marked as "cancelled" or highlighted in red contain sensitive data. These files are redacted by the LLM system and are kept in a cancelled state to ensure data protection.

For Basic Mode file upload kinldy follow this document.
Even in basic mode you can add the contents by navigating to Data tab and click on Training tab.
Once clicked you will see the options to upload the files,

Add the required contents and click on save.

Once saved, you can have the conversation with the bot, the bot will scrape down the information and provide you the response, if any sensitive, PII, toxicity, harmful instructions, NSFW material or content that shouldn’t enter the knowledge base if these information's are present in the document it will provide a msg saying " Your message couldn’t be processed because it may contain restricted or sensitive content. Please revise and try again".

Last updated
Was this helpful?