Architectural changes on RAG file upload feature

To enhance the RAG (Retrieval-Augmented Generation) file upload feature, the emphasis is on achieving faster, more reliable, and scalable uploads. "File upload size and limit optimization" involves architectural and technical adjustments to efficiently handle large files while maintaining performance.

File ingestion dashboard that lists documents uploaded into a RAG (Retrieval‑Augmented Generation) system. It displays columns such as:

  • Created on

  • Location (file path or URL) - This feature displays the icon next to the file name, aiding users in distinguishing between different types of uploaded files.

  • Status (Completed / Canceled / Failed)

  • Actions (Preview, Refresh, Delete)

This dashboard is typically part of a RAG ingestion pipeline where each uploaded file is processed, chunked, embedded, and stored in a vector database.

The Need for Architectural Optimization

Architectural optimization is essential due to the computational intensity of RAG ingestion, which includes:

  • Reading the file

  • Extracting text

  • Cleaning and chunking text

  • Generating embeddings

  • Storing vectors

  • Linking metadata

  • Validating structure

Large files or numerous simultaneous uploads can lead to bottlenecks, manifesting in the UI as:

  • Canceled ingestion jobs

  • Stagnant in-progress states

  • Slow processing

  • Errors during chunk generation

  • Incomplete embeddings

Enhancing file size handling, upload limits, and the ingestion backend directly addresses these operational challenges, resulting in a more efficient process.

Detailed Explanation of Status Codes in the RAG Training Pipeline

These statuses represent the lifecycle of a file during the ingestion + training pipeline in your RAG (Retrieval-Augmented Generation) system. Each status corresponds to a specific processing stage.

  • REQUESTED: This is the initial status. It indicates that the user has added content (S3 file, SharePoint file, Web Scraper URL, FAQ entry, etc.) and the system has registered the ingestion request. Right after the user hits Save, before any file extraction or processing starts.It ensures that the system logs the request even before workers begin processing it.

  • CONTENTS_EXTRACTED: The system successfully extracted text/content from the uploaded file.

    This step includes:

    • File reading

    • OCR (if needed)

    • HTML scraping

    • Text extraction

    • Cleaning and normalization

    • Chunking preparation

    This happens after extraction but before: Chunking, PII Redaction, Moderation, Vector Generation

Some files may upload successfully but still fail extraction (PDF parsing error, corrupted file, etc.). This status helps identify where failure happens.

  • COMPLETED: The entire ingestion → processing → training pipeline completed successfully.

    • All steps have finished: File uploaded -> Contents extracted -> Chunks created -> PII scanning & reduction -> LLM moderation passed -> Embeddings generated -> Stored in Vector DB -> Metadata saved

  • FAILED_TO_EXTRACT_CONTENTS: The system could not extract text from the file.

    • The reasons could be: Corrupted PDFs, Unsupported file format, Scanned images without OCR, Empty files, Timeout during extraction, Permission errors in external sources

  • FAILED_TO_COMPLETE: Content was extracted successfully, but processing later in the pipeline failed. Failure may occur in: PII reduction, LLM moderation, Embedding generation, Upload to Vector DB, DynamoDB logging, Timeout during training job

  • CANCELLED: The training or ingestion job was terminated before completion.

    • Reasons could be : Someone clicked Delete or Cancel.

    • System auto‑cancellation: LLM moderation rejection, Violating severity thresholds, Detection of harmful categories, Admin rule triggers

Last updated

Was this helpful?