Architectural changes on RAG file upload feature
To enhance the RAG (Retrieval-Augmented Generation) file upload feature, the emphasis is on achieving faster, more reliable, and scalable uploads. "File upload size and limit optimization" involves architectural and technical adjustments to efficiently handle large files while maintaining performance.

File ingestion dashboard that lists documents uploaded into a RAG (Retrieval‑Augmented Generation) system. It displays columns such as:
Created on
Location (file path or URL) - This feature displays the icon next to the file name, aiding users in distinguishing between different types of uploaded files.
Status (Completed / Canceled / Failed)
Actions (Preview, Refresh, Delete)
This dashboard is typically part of a RAG ingestion pipeline where each uploaded file is processed, chunked, embedded, and stored in a vector database.
The Need for Architectural Optimization
Architectural optimization is essential due to the computational intensity of RAG ingestion, which includes:
Reading the file
Extracting text
Cleaning and chunking text
Generating embeddings
Storing vectors
Linking metadata
Validating structure
Large files or numerous simultaneous uploads can lead to bottlenecks, manifesting in the UI as:
Canceled ingestion jobs
Stagnant in-progress states
Slow processing
Errors during chunk generation
Incomplete embeddings
Enhancing file size handling, upload limits, and the ingestion backend directly addresses these operational challenges, resulting in a more efficient process.
Detailed Explanation of Status Codes in the RAG Training Pipeline
These statuses represent the lifecycle of a file during the ingestion + training pipeline in your RAG (Retrieval-Augmented Generation) system. Each status corresponds to a specific processing stage.

REQUESTED: This is the initial status. It indicates that the user has added content (S3 file, SharePoint file, Web Scraper URL, FAQ entry, etc.) and the system has registered the ingestion request. Right after the user hits Save, before any file extraction or processing starts.It ensures that the system logs the request even before workers begin processing it.
CONTENTS_EXTRACTED: The system successfully extracted text/content from the uploaded file.
This step includes:
File reading
OCR (if needed)
HTML scraping
Text extraction
Cleaning and normalization
Chunking preparation
This happens after extraction but before: Chunking, PII Redaction, Moderation, Vector Generation
Some files may upload successfully but still fail extraction (PDF parsing error, corrupted file, etc.). This status helps identify where failure happens.
COMPLETED: The entire ingestion → processing → training pipeline completed successfully.
All steps have finished: File uploaded -> Contents extracted -> Chunks created -> PII scanning & reduction -> LLM moderation passed -> Embeddings generated -> Stored in Vector DB -> Metadata saved
FAILED_TO_EXTRACT_CONTENTS: The system could not extract text from the file.
The reasons could be: Corrupted PDFs, Unsupported file format, Scanned images without OCR, Empty files, Timeout during extraction, Permission errors in external sources
FAILED_TO_COMPLETE: Content was extracted successfully, but processing later in the pipeline failed. Failure may occur in: PII reduction, LLM moderation, Embedding generation, Upload to Vector DB, DynamoDB logging, Timeout during training job
CANCELLED: The training or ingestion job was terminated before completion.
Reasons could be : Someone clicked Delete or Cancel.
System auto‑cancellation: LLM moderation rejection, Violating severity thresholds, Detection of harmful categories, Admin rule triggers
Last updated
Was this helpful?