# Architectural changes on RAG file upload feature

To enhance the **RAG (Retrieval-Augmented Generation) file upload feature**, the emphasis is on achieving **faster, more reliable, and scalable** uploads. "File upload size and limit optimization" involves architectural and technical adjustments to efficiently handle large files while maintaining performance.

<figure><img src="/files/JgQFBkuhkEVWfWoQT3RC" alt=""><figcaption></figcaption></figure>

File ingestion dashboard that lists documents uploaded into a RAG (Retrieval‑Augmented Generation) system. It displays columns such as:

* Created on
* Location (file path or URL) - This feature displays the icon next to the file name, aiding users in distinguishing between different types of uploaded files.
* Status (Completed / Canceled / Failed)
* Actions (Preview, Refresh, Delete)

This dashboard is typically part of a RAG ingestion pipeline where each uploaded file is processed, chunked, embedded, and stored in a vector database.

#### The Need for Architectural Optimization

Architectural optimization is essential due to the computational intensity of RAG ingestion, which includes:

* Reading the file
* Extracting text
* Cleaning and chunking text
* Generating embeddings
* Storing vectors
* Linking metadata
* Validating structure

Large files or numerous simultaneous uploads can lead to **bottlenecks**, manifesting in the UI as:

* Canceled ingestion jobs
* Stagnant in-progress states
* Slow processing
* Errors during chunk generation
* Incomplete embeddings

Enhancing **file size handling, upload limits, and the ingestion backend** directly addresses these operational challenges, resulting in a more efficient process.

#### Detailed Explanation of Status Codes in the RAG Training Pipeline

These statuses represent the **lifecycle of a file during the ingestion + training pipeline** in your RAG (Retrieval-Augmented Generation) system. Each status corresponds to a specific processing stage.

<figure><img src="/files/SOqIL2kCTduyr1Qu9qgv" alt=""><figcaption></figcaption></figure>

* REQUESTED: This is the **initial** status. It indicates that the user has added content (S3 file, SharePoint file, Web Scraper URL, FAQ entry, etc.) and the system has registered the ingestion request. Right after the user hits **Save**, before any file extraction or processing starts.It ensures that the system logs the request even before workers begin processing it.
* CONTENTS\_EXTRACTED: The system successfully **extracted text/content** from the uploaded file.

  This step includes:

  * File reading
  * OCR (if needed)
  * HTML scraping
  * Text extraction
  * Cleaning and normalization
  * Chunking preparation

  This happens after extraction but before: Chunking, PII Redaction, Moderation, Vector Generation

Some files may upload successfully but still fail extraction (PDF parsing error, corrupted file, etc.). This status helps identify where failure happens.

* COMPLETED: The entire ingestion → processing → training pipeline completed successfully.
  * All steps have finished: File uploaded -> Contents extracted -> Chunks created -> PII scanning & reduction -> LLM moderation passed -> Embeddings generated -> Stored in Vector DB -> Metadata saved
* FAILED\_TO\_EXTRACT\_CONTENTS: The system **could not extract text** from the file.
  * The reasons could be: Corrupted PDFs, Unsupported file format, Scanned images without OCR, Empty files, Timeout during extraction, Permission errors in external sources
* FAILED\_TO\_COMPLETE: Content was extracted successfully, but **processing later in the pipeline failed**. Failure may occur in: PII reduction, LLM moderation, Embedding generation, Upload to Vector DB, DynamoDB logging, Timeout during training job
* CANCELLED: The training or ingestion job was **terminated** before completion.
  * Reasons could be : Someone clicked *Delete* or *Cancel*.
  * System auto‑cancellation: LLM moderation rejection, Violating severity thresholds, Detection of harmful categories, Admin rule triggers


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.ixhello.com/ixhc/architectural-changes-on-rag-file-upload-feature.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
