Document Ingestion Pipeline
How Septum turns an uploaded file into searchable, anonymised content. Every step runs locally — raw PII never leaves the machine.
Steps
- Upload — a file reaches the API. Type is identified from content bytes (
python-magic), never from the extension. A.pdfthat is actually a PNG is routed to the OCR pipeline. - Format-specific ingester — PDF through
pypdfwith per-page extraction; DOCX / XLSX throughpython-docxandopenpyxl; images and scanned PDFs through PaddleOCR; audio through OpenAI Whisper. - Language detection — the extracted text is classified across 20+ languages. The result drives the NER model choice and validator rules used by the next step.
- Three-layer PII detection — three detectors run in series and their outputs are merged:
- Presidio — regex patterns + algorithmic validators (TCKN checksum, Aadhaar Verhoeff, NRIC/FIN, CPF, NINO, CNPJ, My Number, and more).
- NER — XLM-RoBERTa transformer for
PERSON_NAME,LOCATION,ORGANIZATION_NAME. - Ollama — optional semantic layer for context-dependent PII that patterns cannot catch (off by default; set
SEPTUM_USE_OLLAMA=trueto enable). Overlapping spans are merged; coreferences are resolved so "Ahmet" and "Mr. Yılmaz" collapse to the same placeholder index.
- Masking + anonymisation map — every detected entity is replaced with a stable, type-indexed placeholder (
[PERSON_1],[EMAIL_ADDRESS_3]). Theoriginal → placeholdermap is persisted per-document, encrypted at rest, and never leaves the air-gapped zone. - Parallel processing — two tracks run concurrently on the masked output:
- Chunking → Embedding → FAISS + BM25 — the masked text is split semantically (paragraph-aware with overlap), each chunk is embedded with sentence-transformers, and written to both the FAISS vector index and the BM25 keyword index.
- Encrypted storage — the original file is sealed with AES-256-GCM on disk. It is never decrypted outside the air-gapped zone.
- Ready for search — when all three tracks (FAISS, BM25, encrypted blob) complete, the document is flagged
ingestion_status="completed"and becomes queryable from chat.