# Architectural changes on RAG file upload feature

To enhance the **RAG (Retrieval-Augmented Generation) file upload feature**, the emphasis is on achieving **faster, more reliable, and scalable** uploads. "File upload size and limit optimization" involves architectural and technical adjustments to efficiently handle large files while maintaining performance.

<figure><img src="https://1107164708-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M8XHvUsfyTUFLvToHqD%2Fuploads%2FDFjdXp84ffgPBiJpGOuR%2Fimage.png?alt=media&#x26;token=803340ed-cb52-40dc-9672-423301ab2228" alt=""><figcaption></figcaption></figure>

File ingestion dashboard that lists documents uploaded into a RAG (Retrieval‑Augmented Generation) system. It displays columns such as:

* Created on
* Location (file path or URL) - This feature displays the icon next to the file name, aiding users in distinguishing between different types of uploaded files.
* Status (Completed / Canceled / Failed)
* Actions (Preview, Refresh, Delete)

This dashboard is typically part of a RAG ingestion pipeline where each uploaded file is processed, chunked, embedded, and stored in a vector database.

#### The Need for Architectural Optimization

Architectural optimization is essential due to the computational intensity of RAG ingestion, which includes:

* Reading the file
* Extracting text
* Cleaning and chunking text
* Generating embeddings
* Storing vectors
* Linking metadata
* Validating structure

Large files or numerous simultaneous uploads can lead to **bottlenecks**, manifesting in the UI as:

* Canceled ingestion jobs
* Stagnant in-progress states
* Slow processing
* Errors during chunk generation
* Incomplete embeddings

Enhancing **file size handling, upload limits, and the ingestion backend** directly addresses these operational challenges, resulting in a more efficient process.

#### Detailed Explanation of Status Codes in the RAG Training Pipeline

These statuses represent the **lifecycle of a file during the ingestion + training pipeline** in your RAG (Retrieval-Augmented Generation) system. Each status corresponds to a specific processing stage.

<figure><img src="https://1107164708-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M8XHvUsfyTUFLvToHqD%2Fuploads%2FM4djFmMgSvAFdmYh8UVF%2Fimage.png?alt=media&#x26;token=5f403277-f1e0-414c-8726-ec1002486428" alt=""><figcaption></figcaption></figure>

* REQUESTED: This is the **initial** status. It indicates that the user has added content (S3 file, SharePoint file, Web Scraper URL, FAQ entry, etc.) and the system has registered the ingestion request. Right after the user hits **Save**, before any file extraction or processing starts.It ensures that the system logs the request even before workers begin processing it.
* CONTENTS\_EXTRACTED: The system successfully **extracted text/content** from the uploaded file.

  This step includes:

  * File reading
  * OCR (if needed)
  * HTML scraping
  * Text extraction
  * Cleaning and normalization
  * Chunking preparation

  This happens after extraction but before: Chunking, PII Redaction, Moderation, Vector Generation

Some files may upload successfully but still fail extraction (PDF parsing error, corrupted file, etc.). This status helps identify where failure happens.

* COMPLETED: The entire ingestion → processing → training pipeline completed successfully.
  * All steps have finished: File uploaded -> Contents extracted -> Chunks created -> PII scanning & reduction -> LLM moderation passed -> Embeddings generated -> Stored in Vector DB -> Metadata saved
* FAILED\_TO\_EXTRACT\_CONTENTS: The system **could not extract text** from the file.
  * The reasons could be: Corrupted PDFs, Unsupported file format, Scanned images without OCR, Empty files, Timeout during extraction, Permission errors in external sources
* FAILED\_TO\_COMPLETE: Content was extracted successfully, but **processing later in the pipeline failed**. Failure may occur in: PII reduction, LLM moderation, Embedding generation, Upload to Vector DB, DynamoDB logging, Timeout during training job
* CANCELLED: The training or ingestion job was **terminated** before completion.
  * Reasons could be : Someone clicked *Delete* or *Cancel*.
  * System auto‑cancellation: LLM moderation rejection, Violating severity thresholds, Detection of harmful categories, Admin rule triggers
